zoukankan      html  css  js  c++  java
  • Pandas的对齐运算和函数

    Pandas的对齐运算

    是数据清洗的重要过程,可以按索引对齐进行运算,如果没对齐的位置则补NaN,最后也可以填充NaN

    Series的对齐运算

    1. Series 按行、索引对齐

    s1 = pd.Series(range(10, 20), index=range(10))
    s2 = pd.Series(range(20, 25), index=range(5))
    
    print('s1: ')
    print(s1)
    
    print('')
    
    print('s2: ')
    print(s2)

    效果:

    s1: 
    0    10
    1    11
    2    12
    3    13
    4    14
    5    15
    6    16
    7    17
    8    18
    9    19
    dtype: int64
    
    s2: 
    0    20
    1    21
    2    22
    3    23
    4    24
    dtype: int64

    2. Series的对齐运算

    s1 = pd.Series(range(10, 20), index=range(10))
    s2 = pd.Series(range(20, 25), index=range(5))
    print(s1)
    print(s2)
    print(s1+s2)

    效果

    0    10
    1    11
    2    12
    3    13
    4    14
    5    15
    6    16
    7    17
    8    18
    9    19
    dtype: int64
    0    20
    1    21
    2    22
    3    23
    4    24
    dtype: int64
    0    30.0
    1    32.0
    2    34.0
    3    36.0
    4    38.0
    5     NaN
    6     NaN
    7     NaN
    8     NaN
    9     NaN
    dtype: float64

    DataFrame的对齐运算

    1. DataFrame按行、列索引对齐

    df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b'])
    df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c'])
    
    print('df1: ')
    print(df1)
    
    print('')
    print('df2: ')
    print(df2)

    效果:

    df1: 
         a    b
    0  1.0  1.0
    1  1.0  1.0
    
    df2: 
         a    b    c
    0  1.0  1.0  1.0
    1  1.0  1.0  1.0
    2  1.0  1.0  1.0

    2. DataFrame的对齐运算

    df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b'])
    df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c'])
    
    print('df1: ')
    print(df1)
    
    print('')
    print('df2: ')
    print(df2)
    print('df1+df2: ')
    print(df1 + df2)

    效果:

    df1: 
         a    b
    0  1.0  1.0
    1  1.0  1.0
    
    df2: 
         a    b    c
    0  1.0  1.0  1.0
    1  1.0  1.0  1.0
    2  1.0  1.0  1.0
    df1+df2: 
         a    b   c
    0  2.0  2.0 NaN
    1  2.0  2.0 NaN
    2  NaN  NaN NaN

    填充未对齐的数据进行运算

    1. fill_value

    使用add, sub, div, mul的同时,

    通过fill_value指定填充值,未对齐的数据将和填充值做运算

    import pandas as pd
    
    import numpy as np
    
    # df_obj = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
    # # 通过list构建Series
    # ser_data = {"a": 17.8, "b": 20.1, "c": 16.5,"d":12}
    # ser_obj = pd.Series(ser_data)
    s1 = pd.Series(range(10, 20), index = range(10))
    s2 = pd.Series(range(20, 25), index = range(5))
    print(s1)
    print(s2)
    
    print(s1.add(s2, fill_value = -1))
    df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
    df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])
    print(df1)
    print(df2)
    
    print(df1.sub(df2, fill_value = 2.))

    效果

    0    10
    1    11
    2    12
    3    13
    4    14
    5    15
    6    16
    7    17
    8    18
    9    19
    dtype: int64
    0    20
    1    21
    2    22
    3    23
    4    24
    dtype: int64
    0    30.0
    1    32.0
    2    34.0
    3    36.0
    4    38.0
    5    14.0
    6    15.0
    7    16.0
    8    17.0
    9    18.0
    dtype: float64
         a    b
    0  1.0  1.0
    1  1.0  1.0
         a    b    c
    0  1.0  1.0  1.0
    1  1.0  1.0  1.0
    2  1.0  1.0  1.0
         a    b    c
    0  0.0  0.0  1.0
    1  0.0  0.0  1.0
    2  1.0  1.0  1.0

    Pandas的函数应用

    apply 和 applymap

    1. 可直接使用NumPy的函数

    df = pd.DataFrame(np.random.randn(5,4) - 1)
    print(df)
    
    print(np.abs(df))

    效果:

              0         1         2         3
    0 -0.638228 -0.615340 -2.416771 -0.521187
    1 -0.978901 -0.765940 -0.821583 -0.109666
    2 -0.182581 -0.820414 -0.497785  1.638130
    3 -1.398201  0.893015 -1.109652 -1.740068
    4 -0.079365 -0.750413  0.847062 -1.175580
              0         1         2         3
    0  0.638228  0.615340  2.416771  0.521187
    1  0.978901  0.765940  0.821583  0.109666
    2  0.182581  0.820414  0.497785  1.638130
    3  1.398201  0.893015  1.109652  1.740068
    4  0.079365  0.750413  0.847062  1.175580

    2. 通过apply将函数应用到列或行上

    df = pd.DataFrame(np.random.randn(5, 4) - 1)
    print(df)
    
    print(df.apply(lambda x: x.max()))

    效果:

             0         1         2         3
    0 -0.672592 -0.917094 -1.698291 -2.683744
    1 -1.593442  0.308978 -0.668113 -0.867197
    2 -1.023184 -0.406812 -1.993301 -0.516704
    3 -0.666674 -0.524327 -2.032358  0.192416
    4 -0.466286 -1.319539 -1.643544 -1.137968
    0   -0.466286
    1    0.308978
    2   -0.668113
    3    0.192416
    dtype: float64

    注意指定轴的方向,默认axis=0,方向是列

    df = pd.DataFrame(np.random.randn(5, 4) - 1)
    print(df)
    
    print(df.apply(lambda x: x.max()))
    # 指定轴方向,axis=1,方向是行
    print(df.apply(lambda x : x.max(), axis=1))

    效果

             0         1         2         3
    0 -1.053992 -0.627906 -2.195281 -0.433810
    1 -1.838847  0.821711  0.005306 -0.485479
    2 -0.194641 -0.608357  0.476059 -0.989364
    3 -0.935286  0.370543 -0.316234 -0.482919
    4 -0.142188 -2.685907 -0.757193 -0.150942
    0   -0.142188
    1    0.821711
    2    0.476059
    3   -0.150942
    dtype: float64
    0   -0.433810
    1    0.821711
    2    0.476059
    3    0.370543
    4   -0.142188
    dtype: float64

    3. 通过applymap将函数应用到每个数据上

    df = pd.DataFrame(np.random.randn(5, 4) - 1)
    print(df)
    
    # 使用applymap应用到每个数据
    f2 = lambda x : '%.2f' % x
    print(df.applymap(f2))

    效果

              0         1         2         3
    0 -1.477573 -2.256976 -1.665249  0.381750
    1 -1.748229 -0.457566 -1.138169 -1.741856
    2 -1.456192 -0.596993 -1.293459  1.057294
    3 -0.845528 -0.725874 -2.720255  0.472505
    4 -0.927104 -1.748213 -0.382931  0.046957
           0      1      2      3
    0  -1.48  -2.26  -1.67   0.38
    1  -1.75  -0.46  -1.14  -1.74
    2  -1.46  -0.60  -1.29   1.06
    3  -0.85  -0.73  -2.72   0.47
    4  -0.93  -1.75  -0.38   0.05

    排序

    1. 索引排序

    sort_index()

    排序默认使用升序排序,ascending=False 为降序排序

    s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
    print(s4)
    
    # 索引排序
    s4.sort_index() # 0 0 1 3 3
    print(s4.sort_index() )

    效果

    0    10
    2    11
    3    12
    4    13
    3    14
    dtype: int64
    0    10
    2    11
    3    12
    3    14
    4    13

    对DataFrame操作时注意轴方向

    df4 = pd.DataFrame(np.random.randn(3, 5),
                       index=np.random.randint(3, size=3),
                       columns=np.random.randint(5, size=5))
    print(df4)
    
    df4_isort = df4.sort_index(axis=1, ascending=False)
    print(df4_isort) # 4 2 1 1 0

    效果

              1         1         4         2         0
    0  0.661257 -1.022631  0.337867 -0.680210  0.018720
    2  0.486521 -0.617665 -1.566189  1.484633  0.284891
    2 -0.902534  2.621820 -0.278090 -0.807439  1.121617
              4         2         1         1         0
    0  0.337867 -0.680210  0.661257 -1.022631  0.018720
    2 -1.566189  1.484633  0.486521 -0.617665  0.284891
    2 -0.278090 -0.807439 -0.902534  2.621820  1.121617

    2. 按值排序

    sort_values(by='column name')

    根据某个唯一的列名进行排序,如果有其他相同列名则报错。

    df4 = pd.DataFrame(np.random.randn(3, 5))
    print(df4)
    # 按值排序
    df4_vsort = df4.sort_values(by=0, ascending=False)
    print(df4_vsort)
            0         1         2         3         4
    0 -0.579405  1.055458 -2.274356 -1.215769  1.582240
    1  2.081478 -0.687347  0.854755 -0.011375 -2.779123
    2  1.824004 -1.294691  0.940245  1.626087 -0.539030
              0         1         2         3         4
    1  2.081478 -0.687347  0.854755 -0.011375 -2.779123
    2  1.824004 -1.294691  0.940245  1.626087 -0.539030
    0 -0.579405  1.055458 -2.274356 -1.215769  1.582240

    处理缺失数据

    df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
                           [np.nan, 4., np.nan], [1., 2., 3.]])
    print(df_data.head())

    效果

              0         1         2
    0 -3.094288 -0.914912  2.419605
    1  1.000000  2.000000       NaN
    2       NaN  4.000000       NaN
    3  1.000000  2.000000  3.000000

    1. 判断是否存在缺失值:isnull()

    2. 丢弃缺失数据:dropna()

    根据axis轴方向,丢弃包含NaN的行或列

    3. 填充缺失数据:fillna()

    df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
                           [np.nan, 4., np.nan], [1., 2., 3.]])
    print(df_data.head())
    # isnull
    print(df_data.isnull())
    # dropna
    print(df_data.dropna())
    
    print(df_data.dropna(axis=1))
    # fillna
    print(df_data.fillna(-100.))

    效果

      0         1         2
    0 -0.390745  1.712754 -0.156704
    1  1.000000  2.000000       NaN
    2       NaN  4.000000       NaN
    3  1.000000  2.000000  3.000000
           0      1      2
    0  False  False  False
    1  False  False   True
    2   True  False   True
    3  False  False  False
              0         1         2
    0 -0.390745  1.712754 -0.156704
    3  1.000000  2.000000  3.000000
              1
    0  1.712754
    1  2.000000
    2  4.000000
    3  2.000000
                0         1           2
    0   -0.390745  1.712754   -0.156704
    1    1.000000  2.000000 -100.000000
    2 -100.000000  4.000000 -100.000000
    3    1.000000  2.000000    3.000000
  • 相关阅读:
    [置顶] windows player,wzplayerV2 for windows
    wzplayer 近期将会支持BlackBerry和WinPhone8
    wzplayerEx for android(真正硬解接口,支持加密的 player)
    ffmpeg for ios 交叉编译 (支持i686 armv7 armv7s) 包含lame支持
    ffmpeg for ios 交叉编译 (支持i686 armv7 armv7s) 包含lame支持
    编译cegcc 0.59.1
    wzplayer 近期将会支持BlackBerry和WinPhone8
    wzplayerEx for android(真正硬解接口,支持加密的 player)
    windows player,wzplayerV2 for windows(20140416)更新
    编译cegcc 0.59.1
  • 原文地址:https://www.cnblogs.com/loaderman/p/11967210.html
Copyright © 2011-2022 走看看