zoukankan      html  css  js  c++  java
  • Pandas——Series and DataFrane

    数据科学——pandas库

    pandas中有两个主要的数据结构,一个是Series,另一个是DataFrame。通过这两类数据,可以下载数据、可视化数据、和分析数据。

    Pandas安装:pip install pandas

    import numpy as np
    import pandas as pd
    a = np.array([1,5,3,4,10,0,9])
    b = pd.Series([1,5,3,4,10,0,9])
    print(a)
    print(b)
    
    [ 1  5  3  4 10  0  9]
    0     1
    1     5
    2     3
    3     4
    4    10
    5     0
    6     9
    dtype: int64
    

    Series就如同列表一样,具有一系列数据,类似一维数组的对象。每个数据对应一个索引值。比如这样一个列表:[9, 3, 8],如果跟索引值写在一起。

    Series有两个属性:values和index有些时候,需要把他竖过来表示,Series就是“竖起来”的array

    import pandas as pd
    b = pd.Series([1,5,3,4,10,0,9])
    print (b.values)
    print (b.index)
    print (type(b.values))
    
    [ 1  5  3  4 10  0  9]
    RangeIndex(start=0, stop=7, step=1)
    <class 'numpy.ndarray'>
    
    import pandas as pd
    s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])
    print (s)
    
    张三    21
    李四    19
    王五    20
    赵六    50
    dtype: int64
    
    s['赵六']
    
    50
    
    • 通过list构建Series
    • 由数据和索引组成
      1. 索引在左,数据在右
      2. 索引是自动创建的
    • 获取数据和索引
      ser_obj.index, ser_obj.values
    • 预览数据
      ser_obj.head(n)
    import pandas as pd
    countries = ['中国','美国','日本','德国']
    countries_s = pd.Series(countries)
    print (countries_s)
    
    0    中国
    1    美国
    2    日本
    3    德国
    dtype: object
    
    import pandas as pd
    country_dicts = {'CH': '中国', 'US': '美国', 'AU': '澳大利亚'}
    country_dict_s = pd.Series(country_dicts)
    country_dict_s.index.name = 'Code'
    country_dict_s.name = 'Country'
    print(country_dict_s)
    print(country_dict_s.values)
    print(country_dict_s.index)
    
    Code
    CH      中国
    US      美国
    AU    澳大利亚
    Name: Country, dtype: object
    ['中国' '美国' '澳大利亚']
    Index(['CH', 'US', 'AU'], dtype='object', name='Code')
    

    注:把 key 当索引号了

    列表的索引只能是从 0 开始的整数,Series 数据类型在默认情况下,其索引也是如此。不过,区别于列表的是,Series 可以自定义索引

    import pandas as pd
    data = [1,2,3,4,5]
    ind = ['a','b','c','d','e']
    s = pd.Series (data, index = ind )
    print (s)
    
    a    1
    b    2
    c    3
    d    4
    e    5
    dtype: int64
    

    Series转换成字典

    import pandas as pd
    s = pd.Series ([21,19,20,50], index = ['张三','李四','王五','赵六'])
    s1 = s.to_dict ()
    print (s1)
    
    {'张三': 21, '李四': 19, '王五': 20, '赵六': 50}
    

    向量化操作

    Series 向量化操作(思维)在数据分析和人工智能领域是一个很重要,要把标量转换成向量(数组)

    import numpy as np
    import pandas as pd
    s = range(11)
    s1 = pd.Series(s)
    
    total = np.sum(s1)
    print('total = ',total)
    
    total =  55
    

    DataFrame

    Series 类似于一维数组,DataFrame 是一种二维的数据结构,类似于电子表格。同时具有 行索引(index) 和 列索引(label)。可以看作是由 Series 构成的字典

    每一列都是一个Series。多个列对应行,也有一个行索引,DataFrame列优先,每列数据可以是不同的类型,因为有了标号,所以好提取。

    DataFrame对象及操作

    • 通过Series构建DataFrame
    • 通过dict构建DataFrame
    • 通过列索引获取列数据(Series类型)
      • df_obj[label] 或 df_obj.label
    • 增加列数据,类似dict添加key-value
      • df_obj[new_label] = data
    • 删除列
      • del df_obj[col_idx]
    # 通过 Series 创建Dataframe
    import pandas as pd
    country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})
    country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})
    country3 = pd.Series({'Name': '澳大利亚','Language': 'English (AU)', 'Area':'7.692M km2','Happiness Rank': 9})
    df = pd.DataFrame([country1, country2, country3], index=['CH', 'US', 'AU'])
    print(df)
    
        Name      Language        Area  Happiness Rank
    CH    中国       Chinese  9.597M km2              79
    US    美国  English (US)  9.834M km2              14
    AU  澳大利亚  English (AU)  7.692M km2               9
    
    # 添加数据
    import pandas as pd
    country1 = pd.Series({'Name': '中国','Language': 'Chinese','Area': '9.597M km2','Happiness Rank': 79})
    country2 = pd.Series({'Name': '美国','Language': 'English (US)','Area': '9.834M km2','Happiness Rank': 14})
    df = pd.DataFrame([country1, country2], index=['CH', 'US'])
    df['Location'] = '地球'
    print(df)
    
       Name      Language        Area  Happiness Rank Location
    CH   中国       Chinese  9.597M km2              79       地球
    US   美国  English (US)  9.834M km2              14       地球
    
    # 通过 dict 创建Dataframe
    import pandas as pd
    dt = {0: [9, 8, 7, 6], 1: [3, 2, 1, 0]}
    a = pd.DataFrame(dt)
    print (a)
    
       0  1
    0  9  3
    1  8  2
    2  7  1
    3  6  0
    
    import pandas as pd
    df1 =pd.DataFrame ([[1,2,3],[4,5,6]],index = ['A','B'],columns = ['C1','C2','C3'])
    print (df1)
    
       C1  C2  C3
    A   1   2   3
    B   4   5   6
    
    df1.T
    
    A B
    C1 1 4
    C2 2 5
    C3 3 6
    df1.shape
    
    (2, 3)
    
    df1.size
    
    6
    
    df1.head(1)
    
    C1 C2 C3
    A 1 2 3
    df1.tail(1)
    
    C1 C2 C3
    B 4 5 6
    df1.describe()
    
    C1 C2 C3
    count 2.00000 2.00000 2.00000
    mean 2.50000 3.50000 4.50000
    std 2.12132 2.12132 2.12132
    min 1.00000 2.00000 3.00000
    25% 1.75000 2.75000 3.75000
    50% 2.50000 3.50000 4.50000
    75% 3.25000 4.25000 5.25000
    max 4.00000 5.00000 6.00000
    df1.loc['B']
    
    C1    4
    C2    5
    C3    6
    Name: B, dtype: int64
    
    df1.loc['B'].loc['C2']
    
    5
    
    df1.loc['B', 'C1']
    
    4
    
    df1.iloc[1, 2]
    
    6
    
    import pandas as pd
    data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2014,2015,2016,2017,2018],'Points':[4,25,6,2,3]}
    # 指定行索引
    df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])
    print (df)
    
          name  year  Points
    Day1   Joe  2014       4
    Day2   Cat  2015      25
    Day3  Mike  2016       6
    Day4   Kim  2017       2
    Day5   Amy  2018       3
    
    # 可以选择列
    print(df['Points'])
    
    Day1     4
    Day2    25
    Day3     6
    Day4     2
    Day5     3
    Name: Points, dtype: int64
    

    DataFrame上的操作

    • 列举不同取值
    • 数据分组
    • 合并数据
    • 数据清洗

    列举不同取值

    unique 是一个用来列举 pandas 列中不同取值的方法(函数)

    import pandas as pd
    data = {'name':['Joe','Cat','Mike','Kim','Amy'],'year':[2012,2012,2013,2018,2018],'Points':[4,25,6,2,3]}
    df = pd.DataFrame (data, index = ['Day1','Day2','Day3','Day4','Day5'])
    print (df)
    
          name  year  Points
    Day1   Joe  2012       4
    Day2   Cat  2012      25
    Day3  Mike  2013       6
    Day4   Kim  2018       2
    Day5   Amy  2018       3
    

    首先,通过 DataFram 传入 索引 的方式获取这一列的数据

    然后,在这一列上 调用 unique 方法就会得到不同的取值!

    df['year']
    
    Day1    2012
    Day2    2012
    Day3    2013
    Day4    2018
    Day5    2018
    Name: year, dtype: int64
    
    df['year'].unique()
    
    array([2012, 2013, 2018], dtype=int64)
    

    数据分组

    • 数据按照某种标准划分为组
    • 将函数(方法)别应用于每个组上
    • 将结果组合成单个数据结构

    groupby 是 pandas中最为常用和有效的分组函数,有 sum()、count()、mean() 等统计函数

    df = DataFrame({'key1':['a', 'a', 'b', 'b', 'a'], 
                    'key2':['one', 'two', 'one', 'two', 'one'],
                    'data1':np.random.randn(5),
                    'data2':np.random.randn(5)})
    print(df)
    
      key1 key2     data1     data2
    0    a  one  1.600927 -0.876908
    1    a  two  0.159591  0.288545
    2    b  one  0.919900 -0.982536
    3    b  two  1.158895  1.787031
    4    a  one  0.116526  0.795206
    
    grouped = df.groupby(df['key1'])
    print(grouped.mean())
    
             data1     data2
    key1                    
    a     0.625681  0.068948
    b     1.039398  0.402248
    

    数据合并

    合并是指基于某一列将来自不同的DataFrame的列合并起来

    举例:假设有两个 DataFrame :

    (1)一个是包含学生的 ID、姓名
    (2)第二个包含学生ID、数学、python语言、计算思维三门课的成绩
    

    要求:创建一个新的 DataFrame,包含学生 ID、姓名以及三门课的成绩

    df2 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],
                    'Math':[91, 88, 75, 68],
                    'Python':[81, 82, 87, 76],
                    'Computational thinking':[94, 81, 85, 86]})
    print(df2)
    
           Key  Math  Python  Computational thinking
    0  2015308    91      81                      94
    1  2016312    88      82                      81
    2  2017301    75      87                      85
    3  2017303    68      76                      86
    
    df3 = DataFrame({'Key':['2015308', '2016312', '2017301', '2017303'],
                    'Name':['张三', '李四', '王五', '赵六']})
    print(df3)
    
           Key Name
    0  2015308   张三
    1  2016312   李四
    2  2017301   王五
    3  2017303   赵六
    
    dfnew = pd.merge(df1, df2, on='Key')
    

    数据清洗

    • 处理缺失数据
      1. 判断数据缺失,ser_obj.isnull(), df_obj.isnull(),相反操作为notnull()!
      2. 处理数据缺失
        1. df.fillna(),df.dropna() 填充、删除缺失数据!
        2. df.ffill(),按之前的数据填充!
        3. df.bfill(),按之后的数据填充!
    df2
    
    Key Math Python Computational thinking
    0 2015308 91 81 94
    1 2016312 88 82 81
    2 2017301 75 87 85
    3 2017303 68 76 86
    df2.drop([0, 3])
    
    Key Math Python Computational thinking
    1 2016312 88 82 81
    2 2017301 75 87 85
    # axis指轴,0是行, 1是列,缺省值是0
    df2.drop('Math', axis=1)
    
    Key Python Computational thinking
    0 2015308 81 94
    1 2016312 82 81
    2 2017301 87 85
    3 2017303 76 86

    Quiz

    Q1 For the following code, which of the following statements will not return True?

    import pandas as pd
    
    sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
    obj1 = pd.Series(sdata)
    states = ['California', 'Ohio', 'Oregon', 'Texas']
    obj2 = pd.Series(sdata, index=states)
    obj3 = pd.isnull(obj2)
    
    import math
    
    math.isnan(obj2['California'])
    
    True
    
    obj2
    
    California        NaN
    Ohio          35000.0
    Oregon        16000.0
    Texas         71000.0
    dtype: float64
    
    obj2['California'] == None
    
    False
    
    x = obj2['California']
    obj2['California'] != x
    
    True
    
    obj3['California']
    
    True
    

    Q2 In the below python code, the keys of the dictionary d represent student ranks and the value for each key is a student name. Which of the following can be used to extract rows with student ranks that are lower than or equal to 3?

    import pandas as pd
    d = {
        '1': 'Alice',
        '2': 'Bob',
        '3': 'Rita',
        '4': 'Molly',
        '5': 'Ryan'
    }
    S = pd.Series(d)
    
    S.iloc[0:3]
    
    1    Alice
    2      Bob
    3     Rita
    dtype: object
    

    Q3 Suppose we have a DataFrame named df. We want to change the original DataFrame df in a way that all the column names are cast to upper case. Which of the following expressions is incorrect to perform the same?

    from pandas import DataFrame
    score = {'gre_score':[337, 324, 316, 322, 314], 'toefl_score':[118, 107, 104, 110, 103]}
    score_df = DataFrame(score, index = [1, 2, 3, 4, 5])
    print(score_df)
    
       gre_score  toefl_score
    1        337          118
    2        324          107
    3        316          104
    4        322          110
    5        314          103
    
    score_df.where(score_df['toefl_score'] > 105).dropna()
    
    gre_score toefl_score
    1 337.0 118.0
    2 324.0 107.0
    4 322.0 110.0
    score_df[score_df['toefl_score'] > 105]
    
    gre_score toefl_score
    1 337 118
    2 324 107
    4 322 110
    score_df.where(score_df['toefl_score'] > 105)
    
    gre_score toefl_score
    1 337.0 118.0
    2 324.0 107.0
    3 NaN NaN
    4 322.0 110.0
    5 NaN NaN

    Q5 Which of the following can be used to create a DataFrame in Pandas?

    Python dict

    Pandas Series object

    2D ndarray

    Q6 Which of the following is an incorrect way to drop entries from the Pandas DataFrame named df shown below?

    city_dict = {'one':[0, 4, 8, 12], 'two':[1, 5, 9, 13], 'three':[2, 6, 10, 14], 'four':[3, 7, 11, 15]}
    city_df = DataFrame(city_dict, index=['Ohio', 'Colorado', 'Utah', 'New York'])
    print(city_df)
    
              one  two  three  four
    Ohio        0    1      2     3
    Colorado    4    5      6     7
    Utah        8    9     10    11
    New York   12   13     14    15
    
    print(city_df.drop('two', axis=1))
    
              one  three  four
    Ohio        0      2     3
    Colorado    4      6     7
    Utah        8     10    11
    New York   12     14    15
    
    print(city_df.drop(['Utah', 'Colorado']))
    
              one  two  three  four
    Ohio        0    1      2     3
    New York   12   13     14    15
    

    Q7 For the Series s1 and s2 defined below, which of the following statements will give an error?

    import pandas as pd
    s1 = pd.Series({1: 'Alice', 2: 'Jack', 3: 'Molly'})
    s2 = pd.Series({'Alice': 1, 'Jack': 2, 'Molly': 3})
    print(s1)
    print(s2)
    
    1    Alice
    2     Jack
    3    Molly
    dtype: object
    Alice    1
    Jack     2
    Molly    3
    dtype: int64
    
    s2.iloc[1]
    
    2
    
    s1.loc[1]
    
    'Alice'
    
    s2[1]
    
    2
    
    s2.loc[1]
    

    Q8 Which of the following statements is incorrect?

    • We can use s.iteritems() on a pd.Series object s to iterate on it
    • If s and s1 are two pd.Series objects, we cann't use s.append(s1) to directly append s1 to the existing series s.
    • If s is a pd.Series object, then we can use s.loc[label] to get all data where the index is equal to label.
    • loc and iloc ate two usefil and commonly used Pandas methods.
    s = pd.Series([1, 2, 3])
    s
    
    0    1
    1    2
    2    3
    dtype: int64
    
    s1 = pd.Series([4, 5, 6])
    s1
    
    0    4
    1    5
    2    6
    dtype: int64
    
    s.append(s1)
    s
    
    0    1
    1    2
    2    3
    dtype: int64
    

    Q9 For the given DataFrame df shown above, we want to get all records with a toefl score greater than 105 but smaller than 115. Which of the following expressions is incorrect to perform the same?

    print(score_df)
    
       gre_score  toefl_score
    1        337          118
    2        324          107
    3        316          104
    4        322          110
    5        314          103
    
    score_df[(score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)]
    
    gre_score toefl_score
    2 324 107
    4 322 110
    score_df[(score_df['toefl_score'].isin(range(106, 115)))]
    
    gre_score toefl_score
    2 324 107
    4 322 110
    (score_df['toefl_score'] > 105) & (score_df['toefl_score'] < 115)
    
    1    False
    2     True
    3    False
    4     True
    5    False
    Name: toefl_score, dtype: bool
    
    score_df[score_df['toefl_score'].gt(105) & score_df['toefl_score'].lt(115)]
    
    gre_score toefl_score
    2 324 107
    4 322 110
    stu_dict = {'Name':['Alice', 'Jack'], 'Age':[20, 22], 'Gender':['F', 'M']}
    stu_df = DataFrame(stu_dict, index=['Mathematics', 'Sociology'])
    print(stu_df)
    
                  Name  Age Gender
    Mathematics  Alice   20      F
    Sociology     Jack   22      M
    
    stu_df.loc['Mathematics']
    
    Name      Alice
    Age          20
    Gender        F
    Name: Mathematics, dtype: object
    
    
    
    永远渴望,大智若愚(stay hungry, stay foolish)
  • 相关阅读:
    263 相对布局之3— 相对布局的综合案例
    262 相对布局之2— 相对布局的属性设置
    leetcode-----110. 平衡二叉树
    leetcode-----109. 有序链表转换二叉搜索树
    leetcode-----108. 将有序数组转换为二叉搜索树
    leetcode-----107. 二叉树的层次遍历 II
    leetcode-----106. 从中序与后序遍历序列构造二叉树
    leetcode-----105. 从前序与中序遍历序列构造二叉树
    leetcode-----104. 二叉树的最大深度
    leetcode-----103. 二叉树的锯齿形层次遍历
  • 原文地址:https://www.cnblogs.com/h-hkai/p/14381843.html
Copyright © 2011-2022 走看看