zoukankan      html  css  js  c++  java
  • pandas百题笔记

    Pandas百题笔记

    1.导入 Pandas:

    import pandas as pd

    2.查看 Pandas 版本信息:

    print(pd.__version__)  ==>1.0.1

    Pandas 的数据结构:Pandas 主要有 Series(一维数组),DataFrame(二维数组),Panel(三维数组),Panel4D(四维数组),PanelND(更多维数组)等数据结构。其中 Series 和 DataFrame 应用的最为广泛。

    #Series 是一维带标签的数组,它可以包含任何数据类型。包括整数,字符串,浮点数,Python 对象等。Series 可以通过标签来定位。 #DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。

    创建 Series 数据类型

    创建 Series 语法:s = pd.Series(data, index=index),可以通过多种方式进行创建,以下介绍了 3 个常用方法。

    3.从列表创建 Series:

    arr = [0, 1, 2, 3, 4]
    s1 = pd.Series(arr)  # 如果不指定索引,则默认从 0 开始
    s1  
    ==> 
    0    0
    1    1
    2    2
    3    3
    4    4
    dtype: int64

    4.从 Ndarray 创建 Series:

    import numpy as np
    
    n = np.random.randn(5)  # 创建一个随机 Ndarray 数组
    
    index = ['a', 'b', 'c', 'd', 'e']
    s2 = pd.Series(n, index=index)
    s2
    ==>
    a   -0.766282
    b    0.134975
    c    0.175090
    d    0.298047
    e    0.171916
    dtype: float64

    5.从字典创建 Series:

    d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}  # 定义示例字典
    s3 = pd.Series(d)
    s3
    ==>
    a    1
    b    2
    c    3
    d    4
    e    5
    dtype: int64

     

    Series 基本操作

    6.修改 Series 索引:

    print(s1)  # 以 s1 为例
    
    s1.index = ['A', 'B', 'C', 'D', 'E']  # 修改后的索引
    s1
    ==>
    0    0
    1    1
    2    2
    3    3
    4    4
    dtype: int64
    A    0
    B    1
    C    2
    D    3
    E    4
    dtype: int64

    7.Series 纵向拼接:

    s4 = s3.append(s1)  # 将 s1 拼接到 s3
    s4
    ==>
    a    1
    b    2
    c    3
    d    4
    e    5
    A    0
    B    1
    C    2
    D    3
    E    4
    dtype: int64

    8.Series 按指定索引删除元素:

    print(s4)
    s4 = s4.drop('e')  # 删除索引为 e 的值
    s4
    ==>
    a    1
    b    2
    c    3
    d    4
    e    5
    A    0
    B    1
    C    2
    D    3
    E    4
    dtype: int64
    a    1
    b    2
    c    3
    d    4
    A    0
    B    1
    C    2
    D    3
    E    4
    dtype: int64

    9.Series 修改指定索引元素:

    s4['A'] = 6  # 修改索引为 A 的值 = 6
    s4
    ==>
    a    1
    b    2
    c    3
    d    4
    A    6
    B    1
    C    2
    D    3
    E    4
    dtype: int64

    10.Series 按指定索引查找元素:

    s4['B']
    ==>  1

    11.Series 切片操作:

    例如对s4的前 3 个数据访问
    s4[:3]
    ==>
    a    1
    b    2
    c    3
    dtype: int64

    Series 运算

    12.Series 加法运算:

    Series 的加法运算是按照索引计算,如果索引不同则填充为 NaN(空值)。

    s4.add(s3)
    ==>
    A    NaN
    B    NaN
    C    NaN
    D    NaN
    E    NaN
    a    2.0
    b    4.0
    c    6.0
    d    8.0
    e    NaN
    dtype: float64

    13.Series 减法运算:

    Series的减法运算是按照索引对应计算,如果不同则填充为 NaN(空值)。

    s4.sub(s3)
    ==>
    A    NaN
    B    NaN
    C    NaN
    D    NaN
    E    NaN
    a    0.0
    b    0.0
    c    0.0
    d    0.0
    e    NaN
    dtype: float64

    14.Series 乘法运算:

    Series 的乘法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。

    s4.mul(s3)
    ==>
    A     NaN
    B     NaN
    C     NaN
    D     NaN
    E     NaN
    a     1.0
    b     4.0
    c     9.0
    d    16.0
    e     NaN
    dtype: float64

    15.Series 除法运算:

    Series 的除法运算是按照索引对应计算,如果索引不同则填充为 NaN(空值)。

    s4.div(s3)
    ==>
    A    NaN
    B    NaN
    C    NaN
    D    NaN
    E    NaN
    a    1.0
    b    1.0
    c    1.0
    d    1.0
    e    NaN
    dtype: float64

    16.Series 求中位数:

    s4.median() ==> 3.0

    17.Series 求和:

    s4.sum()  ==> 26

    18.Series 求最大值:

    s4.max()  ==>  6

    19.Series 求最小值:

    s4.min()  ==> 1

     

    创建 DataFrame 数据类型

    与 Sereis 不同,DataFrame 可以存在多列数据。一般情况下,DataFrame 也更加常用。

    20.通过 NumPy 数组创建 DataFrame:

    dates = pd.date_range('today', periods=6)  # 定义时间序列作为 index
    num_arr = np.random.randn(6, 4)  # 传入 numpy 随机数组
    columns = ['A', 'B', 'C', 'D']  # 将列表作为列名
    df1 = pd.DataFrame(num_arr, index=dates, columns=columns)
    df1
    ==>
                                       A         B         C         D
    2020-07-05 13:58:34.723797 -0.820141  0.205872 -0.928024 -1.828410
    2020-07-06 13:58:34.723797  0.750014 -0.340494  1.190786 -0.204266
    2020-07-07 13:58:34.723797 -2.062106 -1.520711  1.414341  1.057326
    2020-07-08 13:58:34.723797 -0.821653  0.564271 -1.274913  2.340385
    2020-07-09 13:58:34.723797 -1.936687  0.447897 -0.108420  0.133166
    2020-07-10 13:58:34.723797  0.707222 -1.251812 -0.235982  0.340147

    21.通过字典数组创建 DataFrame:

    data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
            'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
            'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
            'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
    
    labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
    df2 = pd.DataFrame(data, index=labels)
    df2
    ==>
      animal  age  visits priority
    a    cat  2.5       1      yes
    b    cat  3.0       3      yes
    c  snake  0.5       2       no
    d    dog  NaN       3      yes
    e    dog  5.0       2       no
    f    cat  2.0       3       no
    g  snake  4.5       1       no
    h    cat  NaN       1      yes
    i    dog  7.0       2       no
    j    dog  3.0       1       no
    #字典中的键值直接变为列名

    22.查看 DataFrame 的数据类型:

    df2.dtypes
    ==>
    animal       object
    age         float64
    visits        int64
    priority     object
    dtype: object

    DataFrame 基本操作

    23.预览 DataFrame 的前 5 行数据:

    此方法对快速了解陌生数据集结构十分有用。

    df2.head()  # 默认为显示 5 行,可根据需要在括号中填入希望预览的行数
    ==>
        animal  age    visits priority
    a    cat    2.5       1      yes
    b    cat    3.0       3      yes
    c  snake    0.5       2       no
    d    dog    NaN       3      yes
    e    dog    5.0       2       no

    24.查看 DataFrame 的后 3 行数据:

    df2.tail(3)
    ==>
        animal  age  visits priority
    h    cat  NaN       1      yes
    i    dog  7.0       2       no
    j    dog  3.0       1       no

    25.查看 DataFrame 的索引:

    df2.index
    ==>  Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

    26.查看 DataFrame 的列名:

    df2.columns  
    ==>Index(['animal', 'age', 'visits', 'priority'], dtype='object')

    27.查看 DataFrame 的数值:

    df2.values
    ==>
    array([['cat', 2.5, 1, 'yes'],
           ['cat', 3.0, 3, 'yes'],
           ['snake', 0.5, 2, 'no'],
           ['dog', nan, 3, 'yes'],
           ['dog', 5.0, 2, 'no'],
           ['cat', 2.0, 3, 'no'],
           ['snake', 4.5, 1, 'no'],
           ['cat', nan, 1, 'yes'],
           ['dog', 7.0, 2, 'no'],
           ['dog', 3.0, 1, 'no']], dtype=object)

    28.查看 DataFrame 的统计数据:

    df2.describe()
    ==>
                age     visits
    count  8.000000  10.000000
    mean   3.437500   1.900000
    std    2.007797   0.875595
    min    0.500000   1.000000
    25%    2.375000   1.000000
    50%    3.000000   2.000000
    75%    4.625000   2.750000
    max    7.000000   3.000000

    29.DataFrame 转置操作:

       df2.T
        ==>
                a    b      c    d    e    f      g    h    i    j
        animal    cat  cat  snake  dog  dog  cat  snake  cat  dog  dog
        age       2.5    3    0.5  NaN    5    2    4.5  NaN    7    3
        visits      1    3      2    3    2    3      1    1    2    1
        priority  yes  yes     no  yes   no   no     no  yes   no   no

    30.对 DataFrame 进行按列排序:

    df2.sort_values(by='age')  # 按 age 升序排列
        ==>
          animal  age  visits priority
        c  snake  0.5       2       no
        f    cat  2.0       3       no
        a    cat  2.5       1      yes
        b    cat  3.0       3      yes
        j    dog  3.0       1       no
        g  snake  4.5       1       no
        e    dog  5.0       2       no
        i    dog  7.0       2       no
        d    dog  NaN       3      yes
        h    cat  NaN       1      yes

    31.对 DataFrame 数据切片:

    df2[1:3]
    ==>
        animal  age  visits priority
    b    cat  3.0       3      yes
    c  snake  0.5       2       no

    32.对 DataFrame 通过标签查询(单列):

    df2['age']
    ==>
    a    2.5
    b    3.0
    c    0.5
    d    NaN
    e    5.0
    f    2.0
    g    4.5
    h    NaN
    i    7.0
    j    3.0
    Name: age, dtype: float64
    
    df2.age  # 等价于 df2['age']

    33.对 DataFrame 通过标签查询(多列):

    df2[['age', 'animal']]  # 传入一个列名组成的列表
    ==>
       age   animal
    a  2.5    cat
    b  3.0    cat
    c  0.5  snake
    d  NaN    dog
    e  5.0    dog
    f  2.0    cat
    g  4.5  snake
    h  NaN    cat
    i  7.0    dog
    j  3.0    dog

    34.对 DataFrame 通过位置查询:

    df2.iloc[1:3]  # 查询 2,3 行
    ==>
        animal  age  visits priority
    b    cat  3.0       3      yes
    c  snake  0.5       2       no

    35.DataFrame 副本拷贝:

    生成 DataFrame 副本,方便数据集被多个不同流程使用

    df3 = df2.copy()
    df3
    ==>
      animal  age  visits priority
    a    cat  2.5       1      yes
    b    cat  3.0       3      yes
    c  snake  0.5       2       no
    d    dog  NaN       3      yes
    e    dog  5.0       2       no
    f    cat  2.0       3       no
    g  snake  4.5       1       no
    h    cat  NaN       1      yes
    i    dog  7.0       2       no
    j    dog  3.0       1       no

    36.判断 DataFrame 元素是否为空:

    df3.isnull()  # 如果为空则返回为 True
    ==>
          animal    age  visits  priority
    a   False  False   False     False
    b   False  False   False     False
    c   False  False   False     False
    d   False   True   False     False
    e   False  False   False     False
    f   False  False   False     False
    g   False  False   False     False
    h   False   True   False     False
    i   False  False   False     False
    j   False  False   False     False

    37.添加列数据:

    num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index)
    
    df3['No.'] = num  # 添加以 'No.' 为列名的新数据列
    df3
    ==>
      animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  3.0       3      yes    1
    c  snake  0.5       2       no    2
    d    dog  NaN       3      yes    3
    e    dog  5.0       2       no    4
    f    cat  2.0       3       no    5
    g  snake  4.5       1       no    6
    h    cat  NaN       1      yes    7
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9

    38.根据 DataFrame 的下标值进行更改:

    修改第 2 行与第 2 列对应的值 3.0 → 2.0

    df3.iat[1, 1] = 2  # 索引序号从 0 开始,这里为 1, 1
    df3
    ==>
      animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  2.0       3      yes    1
    c  snake  0.5       2       no    2
    d    dog  NaN       3      yes    3
    e    dog  5.0       2       no    4
    f    cat  2.0       3       no    5
    g  snake  4.5       1       no    6
    h    cat  NaN       1      yes    7
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9

    39.根据 DataFrame 的标签对数据进行修改:

    df3.loc['f', 'age'] = 1.5
    df3
    ==>
        animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  2.0       3      yes    1
    c  snake  0.5       2       no    2
    d    dog  NaN       3      yes    3
    e    dog  5.0       2       no    4
    f    cat  1.5       3       no    5
    g  snake  4.5       1       no    6
    h    cat  NaN       1      yes    7
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9

    40.DataFrame 求平均值操作:

    df3.mean()
    ==>
    age       3.25
    visits    1.90
    No.       4.50
    dtype: float64

    41.对 DataFrame 中任意列做求和操作:

    df3['visits'].sum()  
    ==> 19

    字符串操作

    42.将字符串转化为小写字母:

    string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
                        np.nan, 'CABA', 'dog', 'cat'])
    print(string)
    string.str.lower()
    ==>
    0       A
    1       B
    2       C
    3    Aaba
    4    Baca
    5     NaN
    6    CABA
    7     dog
    8     cat
    dtype: object
    0       a
    1       b
    2       c
    3    aaba
    4    baca
    5     NaN
    6    caba
    7     dog
    8     cat
    dtype: object

    43.将字符串转化为大写字母:

    string.str.upper()
    ==>
    0       A
    1       B
    2       C
    3    AABA
    4    BACA
    5     NaN
    6    CABA
    7     DOG
    8     CAT
    dtype: object

    DataFrame 缺失值操作

    44.对缺失值进行填充:

    df4 = df3.copy()
    print(df4)
    df4.fillna(value=3)
    ==>
        animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  2.0       3      yes    1
    c  snake  0.5       2       no    2
    d    dog  NaN       3      yes    3
    e    dog  5.0       2       no    4
    f    cat  1.5       3       no    5
    g  snake  4.5       1       no    6
    h    cat  NaN       1      yes    7
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9
    
      animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  2.0       3      yes    1
    c  snake  0.5       2       no    2
    d    dog  3.0       3      yes    3
    e    dog  5.0       2       no    4
    f    cat  1.5       3       no    5
    g  snake  4.5       1       no    6
    h    cat  3.0       1      yes    7
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9

    45.删除存在缺失值的行:

    df5 = df3.copy()
    print(df5)
    df5.dropna(how='any')  # 任何存在 NaN 的行都将被删除
    ==>
        animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  2.0       3      yes    1
    c  snake  0.5       2       no    2
    d    dog  NaN       3      yes    3
    e    dog  5.0       2       no    4
    f    cat  1.5       3       no    5
    g  snake  4.5       1       no    6
    h    cat  NaN       1      yes    7
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9
    
      animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  2.0       3      yes    1
    c  snake  0.5       2       no    2
    e    dog  5.0       2       no    4
    f    cat  1.5       3       no    5
    g  snake  4.5       1       no    6
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9

    46.DataFrame 按指定列对齐:

    left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
    right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})
    
    print(left)
    print(right)

    按照 key 列对齐连接,只存在 foo2 相同,所以最后变成一行

    pd.merge(left, right, on='key')
    ==>
        key  one
    0  foo1    1
    1  foo2    2
    
    ​     key  two
    
    0  foo2    4
    1  foo3    5
    
    ​      key    one   two
    
    0      foo2    2     4

    DataFrame 文件操作

    47.CSV 文件写入:

    df3.to_csv('animal.csv')
    print("写入成功.")
    ==>  写入成功.

    48.CSV 文件读取:

    df_animal = pd.read_csv('animal.csv')
    df_animal
    ==>
        Unnamed: 0 animal  age  visits priority  No.
    0          a    cat  2.5       1      yes    0
    1          b    cat  2.0       3      yes    1
    2          c  snake  0.5       2       no    2
    3          d    dog  NaN       3      yes    3
    4          e    dog  5.0       2       no    4
    5          f    cat  1.5       3       no    5
    6          g  snake  4.5       1       no    6
    7          h    cat  NaN       1      yes    7
    8          i    dog  7.0       2       no    8
    9          j    dog  3.0       1       no    9

    49.Excel 写入操作:

    df3.to_excel('animal.xlsx', sheet_name='Sheet1')
    print("写入成功.")
    ==>  写入成功.

    50.Excel 读取操作:

    pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])  
    ==>
        Unnamed: 0 animal  age  visits priority  No.
    0          a    cat  2.5       1      yes    0
    1          b    cat  2.0       3      yes    1
    2          c  snake  0.5       2       no    2
    3          d    dog  NaN       3      yes    3
    4          e    dog  5.0       2       no    4
    5          f    cat  1.5       3       no    5
    6          g  snake  4.5       1       no    6
    7          h    cat  NaN       1      yes    7
    8          i    dog  7.0       2       no    8
    9          j    dog  3.0       1       no    9

    进阶部分

    时间序列索引

    51.建立一个以 2018 年每一天为索引,值为随机数的 Series:

    dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
    s = pd.Series(np.random.rand(len(dti)), index=dti)
    s 
    ==>
    2018-01-01    0.441330
    2018-01-02    0.182571
    2018-01-03    0.141348
    2018-01-04    0.604700
    2018-01-05    0.300351
                    ...   
    2018-12-27    0.499318
    2018-12-28    0.530867
    2018-12-29    0.183895
    2018-12-30    0.163899
    2018-12-31    0.173812
    Freq: D, Length: 365, dtype: float64

    52.统计s 中每一个周三对应值的和:

    周一从 0 开始

    s[s.index.weekday == 2].sum()
    ==>  22.592391213957054

    53.统计s中每个月值的平均值:

    s.resample('M').mean()
    ==>
    2018-01-31    0.441100
    2018-02-28    0.506476
    2018-03-31    0.501672
    2018-04-30    0.510073
    2018-05-31    0.416773
    2018-06-30    0.525039
    2018-07-31    0.433221
    2018-08-31    0.472530
    2018-09-30    0.388529
    2018-10-31    0.550011
    2018-11-30    0.486513
    2018-12-31    0.443012
    Freq: M, dtype: float64

    54.将 Series 中的时间进行转换(秒转分钟):

    s = pd.date_range('today', periods=100, freq='S')
    
    ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)
    
    ts.resample('Min').sum()
    ==>
    2020-07-05 14:48:00    15836
    2020-07-05 14:49:00     9298
    Freq: T, dtype: int64

    55.UTC 世界时间标准:

    s = pd.date_range('today', periods=1, freq='D')  # 获取当前时间
    ts = pd.Series(np.random.randn(len(s)), s)  # 随机数值
    ts_utc = ts.tz_localize('UTC')  # 转换为 UTC 时间
    ts_utc
    ==>
    2020-07-05 14:48:38.609382+00:00   -0.348899
    Freq: D, dtype: float64

    56.转换为上海所在时区:

    ts_utc.tz_convert('Asia/Shanghai')
    ==>
    2020-07-05 22:48:38.609382+08:00   -0.348899
    Freq: D, dtype: float64

    57.不同时间表示方式的转换:

    rng = pd.date_range('1/1/2018', periods=5, freq='M')
    ts = pd.Series(np.random.randn(len(rng)), index=rng)
    print(ts)
    ps = ts.to_period()
    print(ps)
    ps.to_timestamp()
    ==>
    2018-01-31    0.621688
    2018-02-28   -1.937715
    2018-03-31    0.081314
    2018-04-30   -1.308769
    2018-05-31   -0.075345
    Freq: M, dtype: float64
    2018-01    0.621688
    2018-02   -1.937715
    2018-03    0.081314
    2018-04   -1.308769
    2018-05   -0.075345
    Freq: M, dtype: float64
    2018-01-01    0.621688
    2018-02-01   -1.937715
    2018-03-01    0.081314
    2018-04-01   -1.308769
    2018-05-01   -0.075345
    Freq: MS, dtype: float64

    Series 多重索引

    58.创建多重索引 Series:

    构建一个 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))为索引,值为随机数的多重索引 Series。

    letters = ['A', 'B', 'C']
    numbers = list(range(10))
    
    mi = pd.MultiIndex.from_product([letters, numbers])  # 设置多重索引
    s = pd.Series(np.random.rand(30), index=mi)  # 随机数
    s
    ==>
    A  0    0.698046
       1    0.380276
       2    0.873395
       3    0.628864
       4    0.528025
       5    0.677856
       6    0.194495
       7    0.164484
       8    0.018238
       9    0.747468
    B  0    0.623616
       1    0.560504
       2    0.731296
       3    0.760307
       4    0.807663
       5    0.347980
       6    0.005892
       7    0.807262
       8    0.650353
       9    0.803976
    C  0    0.387503
       1    0.943305
       2    0.215817
       3    0.128086
       4    0.252103
       5    0.048908
       6    0.779633
       7    0.825234
       8    0.624257
       9    0.263373
    dtype: float64

    59.多重索引 Series 查询:

    查询索引为 1,3,6 的值

    s.loc[:, [1, 3, 6]]
    ==>
    A  1    0.380276
       3    0.628864
       6    0.194495
    B  1    0.560504
       3    0.760307
       6    0.005892
    C  1    0.943305
       3    0.128086
       6    0.779633
    dtype: float64

    60.多重索引 Series 切片:

    s.loc[pd.IndexSlice[:'B', 5:]]
    ==>
    A  5    0.677856
          6    0.194495
          7    0.164484
          8    0.018238
          9    0.747468
    B  5    0.347980
          6    0.005892
          7    0.807262
          8    0.650353
          9    0.803976
    dtype: float64

    DataFrame 多重索引

    61.根据多重索引创建 DataFrame:

    创建一个以 letters = ['A', 'B'] 和 numbers = list(range(6))为索引,值为随机数据的多重索引 DataFrame。

    frame = pd.DataFrame(np.arange(12).reshape(6, 2),
                         index=[list('AAABBB'), list('123123')],
                         columns=['hello', 'shiyanlou'])
    frame
    ==>
         hello  shiyanlou
    A 1      0          1
        2      2          3
        3      4          5
    B 1      6          7
        2      8          9
        3     10         11

    62.多重索引设置列名称:

    frame.index.names = ['first', 'second']
    frame
    ==>
                     hello    shiyanlou
    first second                  
    A       1           0          1
            2           2          3
            3           4          5
    B       1           6          7
            2           8          9
            3          10         11

    63.DataFrame 多重索引分组求和:

    frame.groupby('first').sum()
    ==>
           hello  shiyanlou
    first                  
    A          6          9
    B         24         27

    64.DataFrame 行列名称转换:

    print(frame)
    frame.stack()
    ==>
                  hello  shiyanlou
    first second                  
    A     1           0          1
          2           2          3
          3           4          5
    B     1           6          7
          2           8          9
          3          10         11
    first  second           
    A      1       hello         0
                   shiyanlou     1
           2       hello         2
                   shiyanlou     3
           3       hello         4
                   shiyanlou     5
    B      1       hello         6
                   shiyanlou     7
           2       hello         8
                   shiyanlou     9
           3       hello        10
                   shiyanlou    11
    dtype: int64

    65.DataFrame 索引转换:

    print(frame)
    frame.unstack()
    ==>
                  hello  shiyanlou
    first second                  
    A     1           0          1
          2           2          3
          3           4          5
    B     1           6          7
          2           8          9
          3          10         11
           hello        shiyanlou       
    second     1  2   3         1  2   3
    first                               
    A          0  2   4         1  3   5
    B          6  8  10         7  9  11

    66.DataFrame 条件查找:

    示例数据

    data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
            'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
            'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
            'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
    
    labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
    df = pd.DataFrame(data, index=labels)

    查找 age 大于 3 的全部信息

    df[df['age'] > 3]
    ==>
      animal  age  visits priority
    e    dog  5.0       2       no
    g  snake  4.5       1       no
    i    dog  7.0       2       no

     

    67.根据行列索引切片:

    df.iloc[2:4, 1:3]
    ==>
          age  visits
    c  0.5       2
    d  NaN       3

    68.DataFrame 多重条件查询:

    查找 age<3 且为 cat 的全部数据。

    df = pd.DataFrame(data, index=labels)
    
    df[(df['animal'] == 'cat') & (df['age'] < 3)]
    ==>
      animal  age  visits priority
    a    cat  2.5       1      yes
    f    cat  2.0       3       no

    69.DataFrame 按关键字查询:

    df3[df3['animal'].isin(['cat', 'dog'])]
    ==>
        animal  age  visits priority  No.
    a    cat  2.5       1      yes    0
    b    cat  2.0       3      yes    1
    d    dog  NaN       3      yes    3
    e    dog  5.0       2       no    4
    f    cat  1.5       3       no    5
    h    cat  NaN       1      yes    7
    i    dog  7.0       2       no    8
    j    dog  3.0       1       no    9

    70.DataFrame 按标签及列名查询:

    df.loc[df2.index[[3, 4, 8]], ['animal', 'age']]
    ==>
        animal  age
    d    dog  NaN
    e    dog  5.0
    i    dog  7.0

    71.DataFrame 多条件排序:

    按照 age 降序,visits 升序排列

    df.sort_values(by=['age', 'visits'], ascending=[False, True])
    ==>
        animal  age  visits priority
    i    dog  7.0       2       no
    e    dog  5.0       2       no
    g  snake  4.5       1       no
    j    dog  3.0       1       no
    b    cat  3.0       3      yes
    a    cat  2.5       1      yes
    f    cat  2.0       3       no
    c  snake  0.5       2       no
    h    cat  NaN       1      yes
    d    dog  NaN       3      yes

    72.DataFrame 多值替换:

    将 priority 列的 yes 值替换为 True,no 值替换为 False。

    df['priority'].map({'yes': True, 'no': False})
    ==>
    a     True
    b     True
    c    False
    d     True
    e    False
    f    False
    g    False
    h     True
    i    False
    j    False
    Name: priority, dtype: bool

    73.DataFrame 分组求和:

    df4.groupby('animal').sum()
    ==>
             age  visits  No.
    animal                   
    cat      6.0       8   13
    dog     15.0       8   24
    snake    5.0       3    8

    74.使用列表拼接多个 DataFrame:

    temp_df1 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 1
    temp_df2 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 2
    temp_df3 = pd.DataFrame(np.random.randn(5, 4))  # 生成由随机数组成的 DataFrame 3
    
    print(temp_df1)
    print(temp_df2)
    print(temp_df3)
    
    pieces = [temp_df1, temp_df2, temp_df3]
    pd.concat(pieces)
    ==>
           0         1         2         3
    0  1.061349  0.927805 -0.270724  0.232218
    1 -2.049875 -0.896899 -0.738298  0.547709
    2  0.084709 -1.801844  0.610220 -1.304246
    3  1.384591  0.872657 -0.829547 -0.332316
    4 -0.255004  2.177881  0.615079  0.767592
    
    ​     0          1         2        3
    
    0  0.009016  1.181569 -1.403829 -0.745604
    1 -0.270313 -0.258377 -1.067346  1.465726
    2 -1.619676 -0.324374 -0.433600  0.211323
    3  0.163223  0.144191  0.717129 -0.555298
    4 -0.718321  1.688866 -0.607994  1.731248
    
    ​      0         1         2         3
    
    0 -1.178622  0.415409  0.496004  1.368869
    1  0.724433 -0.262059  0.514689 -1.666051
    2 -0.325606  0.013015  1.010961  2.075196
    3  2.212960 -0.132432 -1.603347 -1.182487
    4  0.102536  1.384535  0.411434 -0.175592
    
    ​    0          1          2         3
    
    0  1.061349  0.927805 -0.270724  0.232218
    1 -2.049875 -0.896899 -0.738298  0.547709
    2  0.084709 -1.801844  0.610220 -1.304246
    3  1.384591  0.872657 -0.829547 -0.332316
    4 -0.255004  2.177881  0.615079  0.767592
    0  0.009016  1.181569 -1.403829 -0.745604
    1 -0.270313 -0.258377 -1.067346  1.465726
    2 -1.619676 -0.324374 -0.433600  0.211323
    3  0.163223  0.144191  0.717129 -0.555298
    4 -0.718321  1.688866 -0.607994  1.731248
    0 -1.178622  0.415409  0.496004  1.368869
    1  0.724433 -0.262059  0.514689 -1.666051
    2 -0.325606  0.013015  1.010961  2.075196
    3  2.212960 -0.132432 -1.603347 -1.182487
    4  0.102536  1.384535  0.411434 -0.175592

    75.找出 DataFrame 表中和最小的列:

    df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
    print(df)
    df.sum().idxmin()  # idxmax(), idxmin() 为 Series 函数返回最大最小值的索引值
    ==>
              a         b         c         d         e         f         g  
    0  0.931149  0.641776  0.758608  0.630512  0.170375  0.211306  0.973363   
    1  0.730186  0.682949  0.554609  0.356089  0.399012  0.939087  0.908047   
    2  0.261405  0.434525  0.490395  0.368307  0.832568  0.571115  0.936016   
    3  0.161993  0.132176  0.852158  0.140710  0.165902  0.564976  0.656718   
    4  0.810233  0.385639  0.127849  0.166585  0.302643  0.947498  0.164274   
    
    ​            h                 i                j  
    
    0  0.223378  0.115285  0.161207  
    1  0.765946  0.206518  0.951096  
    2  0.891956  0.430530  0.045640  
    3  0.955571  0.962989  0.123037  
    4  0.391810  0.696404  0.561719  
    'd'

    76.DataFrame 中每个元素减去每一行的平均值:

    df = pd.DataFrame(np.random.random(size=(5, 3)))
    print(df)
    df.sub(df.mean(axis=1), axis=0)
    ==>
              0         1         2
    0  0.028539  0.555065  0.166588
    1  0.781335  0.086089  0.616780
    2  0.022462  0.047383  0.476410
    3  0.796853  0.850955  0.765398
    4  0.208298  0.858031  0.264920
    
    ​      0         1         2
    
    0 -0.221525  0.305001 -0.083476
    1  0.286600 -0.408646  0.122046
    2 -0.159623 -0.134702  0.294325
    3 -0.007549  0.046553 -0.039004
    4 -0.235452  0.414281 -0.178830
    

      

    77.DataFrame 分组,并得到每一组中最大三个数之和:

    df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
                       'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]})
    print(df)
    df.groupby('A')['B'].nlargest(3).sum(level=0)
    ==>
        A    B
    0   a   12
    1   a  345
    2   a    3
    3   b    1
    4   b   45
    5   c   14
    6   a    4
    7   a   52
    8   b   54
    9   c   23
    10  c  235
    11  c   21
    12  b   57
    13  b    3
    14  c   87
    A
    a    409
    b    156
    c    345
    Name: B, dtype: int64
    

      

    透视表

    当分析庞大的数据时,为了更好的发掘数据特征之间的关系,且不破坏原数据,就可以利用透视表 pivot_table 进行操作。

    78.透视表的创建:

    新建表将 A, B, C 列作为索引进行聚合。

    df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                       'B': ['A', 'B', 'C'] * 4,
                       'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D': np.random.randn(12),
                       'E': np.random.randn(12)})
    
    print(df)
    
    print(pd.pivot_table(df, index=['A', 'B']))
    ==>
            A  B    C         D         E
    0     one  A  foo -2.718717  1.749056
    1     one  B  foo -0.710776  0.442023
    2     two  C  foo -0.824951  2.244523
    3   three  A  bar  0.300916  1.709200
    4     one  B  bar -2.590790  0.292709
    5     one  C  bar  0.908543 -0.598258
    6     two  A  foo -0.521278  0.204491
    7   three  B  foo -3.302320 -1.762640
    8     one  C  foo -1.311013 -0.722187
    9     one  A  bar  0.785471 -0.231635
    10    two  B  bar -1.758329 -0.031603
    11  three  C  bar  1.236829  1.235032
    
    ​                     D         E
    
    A     B                    
    one   A -0.966623  0.758711
          B -1.650783  0.367366
          C -0.201235 -0.660222
    three A  0.300916  1.709200
          B -3.302320 -1.762640
          C  1.236829  1.235032
    two   A -0.521278  0.204491
          B -1.758329 -0.031603
          C -0.824951  2.244523

    79.透视表按指定行进行聚合:

    将该 DataFrame 的 D 列聚合,按照 A,B 列为索引进行聚合,聚合的方式为默认求均值。

    pd.pivot_table(df, values=['D'], index=['A', 'B'])
    ==>
                    D
    A     B          
    one   A -0.966623
          B -1.650783
          C -0.201235
    three A  0.300916
          B -3.302320
          C  1.236829
    two   A -0.521278
          B -1.758329
          C -0.824951

    80.透视表聚合方式定义:

    上一题中 D 列聚合时,采用默认求均值的方法,若想使用更多的方式可以在 aggfunc 中实现。

    pd.pivot_table(df, values=['D'], index=['A', 'B'], aggfunc=[np.sum, len])
    ==>
                  sum  len
                    D    D
    A     B               
    one   A -1.933246  2.0
          B -3.301567  2.0
          C -0.402470  2.0
    three A  0.300916  1.0
          B -3.302320  1.0
          C  1.236829  1.0
    two   A -0.521278  1.0
          B -1.758329  1.0
          C -0.824951  1.0

    81.透视表利用额外列进行辅助分割:

    D 列按照 A,B 列进行聚合时,若关心 C 列对 D 列的影响,可以加入 columns 值进行分析。

    pd.pivot_table(df, values=['D'], index=['A', 'B'],
                   columns=['C'], aggfunc=np.sum)
    ==>
                    D          
    C             bar       foo
    A     B                    
    one   A  0.785471 -2.718717
          B -2.590790 -0.710776
          C  0.908543 -1.311013
    three A  0.300916       NaN
          B       NaN -3.302320
          C  1.236829       NaN
    two   A       NaN -0.521278
          B -1.758329       NaN
          C       NaN -0.824951

    82.透视表的缺省值处理:

    在透视表中由于不同的聚合方式,相应缺少的组合将为缺省值,可以加入 fill_value 对缺省值处理。

    pd.pivot_table(df, values=['D'], index=['A', 'B'],
                   columns=['C'], aggfunc=np.sum, fill_value=0)
    ==>
                    D          
    C             bar       foo
    A     B                    
    one   A  0.785471 -2.718717
          B -2.590790 -0.710776
          C  0.908543 -1.311013
    three A  0.300916  0.000000
          B  0.000000 -3.302320
          C  1.236829  0.000000
    two   A  0.000000 -0.521278
          B -1.758329  0.000000
          C  0.000000 -0.824951
    

    绝对类型¶

    在数据的形式上主要包括数量型和性质型,数量型表示着数据可数范围可变,而性质型表示范围已经确定不可改变,绝对型数据就是性质型数据的一种。

    83.绝对型数据定义:

    df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": [
                      'a', 'b', 'b', 'a', 'a', 'e']})
    df["grade"] = df["raw_grade"].astype("category")
    df
    ==>
          id raw_grade grade
    0   1         a     a
    1   2         b     b
    2   3         b     b
    3   4         a     a
    4   5         a     a
    5   6         e     e

    84.对绝对型数据重命名:

    df["grade"].cat.categories = ["very good", "good", "very bad"]
    df
    ==>
          id raw_grade      grade
    0   1         a  very good
    1   2         b       good
    2   3         b       good
    3   4         a  very good
    4   5         a  very good
    5   6         e   very bad

    85.重新排列绝对型数据并补充相应的缺省值:

    df["grade"] = df["grade"].cat.set_categories(
        ["very bad", "bad", "medium", "good", "very good"])
    df
    ==>
          id raw_grade      grade
    0   1         a  very good
    1   2         b       good
    2   3         b       good
    3   4         a  very good
    4   5         a  very good
    5   6         e   very bad

    86.对绝对型数据进行排序:

    df.sort_values(by="grade")
    ==>
          id raw_grade      grade
    5   6         e   very bad
    1   2         b       good
    2   3         b       good
    0   1         a  very good
    3   4         a  very good
    4   5         a  very good

    87.对绝对型数据进行分组:

    df.groupby("grade").size()
    ==>
    grade
    very bad     1
    bad          0
    medium       0
    good         2
    very good    3
    dtype: int64

     

    数据清洗

    常常我们得到的数据是不符合我们最终处理的数据要求,包括许多缺省值以及坏的数据,需要我们对数据进行清洗。

    88.缺失值拟合:

    在FilghtNumber中有数值缺失,其中数值为按 10 增长,补充相应的缺省值使得数据完整,并让数据为 int 类型。

    df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
                                   'Budapest_PaRis', 'Brussels_londOn'],
                       'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
                       'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                       'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
                                   '12. Air France', '"Swiss Air"']})
    df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
    df
    ==>
                From_To  FlightNumber  RecentDelays              Airline
    0      LoNDon_paris         10045      [23, 47]               KLM(!)
    1      MAdrid_miLAN         10055            []    <Air France> (12)
    2  londON_StockhOlm         10065  [24, 43, 87]  (British Airways. )
    3    Budapest_PaRis         10075          [13]       12. Air France
    4   Brussels_londOn         10085      [67, 32]          "Swiss Air"

    89.数据列拆分:

    其中From_to应该为两独立的两列From和To,将From_to依照_拆分为独立两列建立为一个新表。

    temp = df.From_To.str.split('_', expand=True)
    temp.columns = ['From', 'To']
    temp
    ==>
           From         To
    0    LoNDon      paris
    1    MAdrid      miLAN
    2    londON  StockhOlm
    3  Budapest      PaRis
    4  Brussels     londOn

    90.字符标准化:

    其中注意到地点的名字都不规范(如:londON应该为London)需要对数据进行标准化处理。

    temp['From'] = temp['From'].str.capitalize()
    temp['To'] = temp['To'].str.capitalize()

    91.删除坏数据加入整理好的数据:

    将最开始的 From_to 列删除,加入整理好的 From 和 to 列。

    df = df.drop('From_To', axis=1)
    df = df.join(temp)
    print(df)
    ==>
          FlightNumber  RecentDelays              Airline      From         To
    0         10045      [23, 47]               KLM(!)    London      Paris
    1         10055            []    <Air France> (12)    Madrid      Milan
    2         10065  [24, 43, 87]  (British Airways. )    London  Stockholm
    3         10075          [13]       12. Air France  Budapest      Paris
    4         10085      [67, 32]          "Swiss Air"  Brussels     London

    92.去除多余字符:

    如同 airline 列中许多数据有许多其他字符,会对后期的数据分析有较大影响,需要对这类数据进行修正。

    df['Airline'] = df['Airline'].str.extract(
        '([a-zA-Zs]+)', expand=False).str.strip()
    df
    ==>
          FlightNumber          Airline      From         To  delay_1  delay_2  
    0         10045              KLM    London      Paris     23.0     47.0   
    1         10055       Air France    Madrid      Milan      NaN      NaN   
    2         10065  British Airways    London  Stockholm     24.0     43.0   
    3         10075       Air France  Budapest      Paris     13.0      NaN   
    4         10085        Swiss Air  Brussels     London     67.0     32.0   
    
       delay_3  
    0      NaN  
    1      NaN  
    2     87.0  
    3      NaN  
    4      NaN  

    93.格式规范:

    在 RecentDelays 中记录的方式为列表类型,由于其长度不一,这会为后期数据分析造成很大麻烦。这里将 RecentDelays 的列表拆开,取出列表中的相同位置元素作为一列,若为空值即用 NaN 代替。

    delays = df['RecentDelays'].apply(pd.Series)
    
    delays.columns = ['delay_{}'.format(n)
                      for n in range(1, len(delays.columns)+1)]
    
    df = df.drop('RecentDelays', axis=1).join(delays)
    df
    ==>
        FlightNumber	Airline	     From	         To	    delay_1	delay_2	delay_3
    0       10045         KLM	    London          Paris     23.0    47.0    NaN
    1       10055   Air France      Madrid          Milan     NaN     NaN     NaN
    2       10065   British Airways London        Stockholm   24.0    43.0    87.0
    3       10075   Air France      Budapest        Paris     13.0    NaN     NaN
    4       10085   Swiss Air       Brussels        London    67.0    32.0    NaN

     

    数据预处理

    94.信息区间划分:

    班级一部分同学的数学成绩表,如下图所示

    df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],
                     'grades':[58,83,79,65,93,45,61,88]})

    但我们更加关心的是该同学是否及格,将该数学成绩按照是否>60来进行划分。

    df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
                                'Frank', 'Grace', 'Jenny'],
                       'grades': [58, 83, 79, 65, 93, 45, 61, 88]})
    
    
    def choice(x):
        if x > 60:
            return 1
        else:
            return 0
    
    
    df.grades = pd.Series(map(lambda x: choice(x), df.grades))
    df
    ==>
        name  grades
    0  Alice       0
    1    Bob       1
    2  Candy       1
    3   Dany       1
    4   Ella       1
    5  Frank       0
    6  Grace       1
    7  Jenny       1

    95.数据去重:

    一个列为A的 DataFrame 数据,如下图所示

    df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

    尝试将 A 列中连续重复的数据清除。

    df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
    df.loc[df['A'].shift() != df['A']]
    ==>
          A
    0  1
    1  2
    3  3
    4  4
    5  5
    8  6
    9  7

    96.数据归一化:

    有时候,DataFrame 中不同列之间的数据差距太大,需要对其进行归一化处理。 其中,Max-Min 归一化是简单而常见的一种方式,公式如下:

     
    def normalization(df):
        numerator = df.sub(df.min())
        denominator = (df.max()).sub(df.min())
        Y = numerator.div(denominator)
        return Y
    
    df = pd.DataFrame(np.random.random(size=(5, 3)))
    print(df)
    normalization(df)
    ==>
              0         1         2
    0  0.923325  0.925392  0.203170
    1  0.770389  0.050410  0.605788
    2  0.146447  0.542584  0.056240
    3  0.161917  0.841527  0.547914
    4  0.948175  0.814426  0.980268
    
    ​      0         1         2
    
    0  0.969004  1.000000  0.159009
    1  0.778247  0.000000  0.594731
    2  0.000000  0.562496  0.000000
    3  0.019297  0.904153  0.532098
    4  1.000000  0.873179  1.000000
    

      

    
    

    Pandas 绘图操作

    为了更好的了解数据包含的信息,最直观的方法就是将其绘制成图。

    97.Series 可视化:

    %matplotlib inline
    ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
    ts = ts.cumsum()
    ts.plot()

    ==>输出图像:

    98.DataFrame 折线图:

    df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
                      columns=['A', 'B', 'C', 'D'])
    df = df.cumsum()
    df.plot()

    ==>输出图像:

    99.DataFrame 散点图:

    df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
    df = df.cumsum()
    df.plot.scatter("xs", "ys", color='red', marker="*")

    ==>输出图像:

    100.DataFrame 柱形图:

    df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
                       "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
                       "month": range(12)
                       })
    
    ax = df.plot.bar("month", "revenue", color="yellow")
    df.plot("month", "advertising", secondary_y=True, ax=ax)

    ==>输出图像:

     

  • 相关阅读:
    php socket 客户端代码
    linux crontab定时执行
    加载 pcntl 多进程
    Xdebug 配置
    Zend Debugger 配置
    windows SVN搭建
    深度学习笔记:优化方法总结(BGD,SGD,Momentum,AdaGrad,RMSProp,Adam)
    操作系统-分段机制
    C++中的new、operator new与placement new
    线程安全的概念
  • 原文地址:https://www.cnblogs.com/zmz-zero/p/13921836.html
Copyright © 2011-2022 走看看