zoukankan      html  css  js  c++  java
  • Pandas的对齐运算和函数

    Pandas的对齐运算

    是数据清洗的重要过程,可以按索引对齐进行运算,如果没对齐的位置则补NaN,最后也可以填充NaN

    Series的对齐运算

    1. Series 按行、索引对齐

    s1 = pd.Series(range(10, 20), index=range(10))
    s2 = pd.Series(range(20, 25), index=range(5))
    
    print('s1: ')
    print(s1)
    
    print('')
    
    print('s2: ')
    print(s2)

    效果:

    s1: 
    0    10
    1    11
    2    12
    3    13
    4    14
    5    15
    6    16
    7    17
    8    18
    9    19
    dtype: int64
    
    s2: 
    0    20
    1    21
    2    22
    3    23
    4    24
    dtype: int64

    2. Series的对齐运算

    s1 = pd.Series(range(10, 20), index=range(10))
    s2 = pd.Series(range(20, 25), index=range(5))
    print(s1)
    print(s2)
    print(s1+s2)

    效果

    0    10
    1    11
    2    12
    3    13
    4    14
    5    15
    6    16
    7    17
    8    18
    9    19
    dtype: int64
    0    20
    1    21
    2    22
    3    23
    4    24
    dtype: int64
    0    30.0
    1    32.0
    2    34.0
    3    36.0
    4    38.0
    5     NaN
    6     NaN
    7     NaN
    8     NaN
    9     NaN
    dtype: float64

    DataFrame的对齐运算

    1. DataFrame按行、列索引对齐

    df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b'])
    df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c'])
    
    print('df1: ')
    print(df1)
    
    print('')
    print('df2: ')
    print(df2)

    效果:

    df1: 
         a    b
    0  1.0  1.0
    1  1.0  1.0
    
    df2: 
         a    b    c
    0  1.0  1.0  1.0
    1  1.0  1.0  1.0
    2  1.0  1.0  1.0

    2. DataFrame的对齐运算

    df1 = pd.DataFrame(np.ones((2, 2)), columns=['a', 'b'])
    df2 = pd.DataFrame(np.ones((3, 3)), columns=['a', 'b', 'c'])
    
    print('df1: ')
    print(df1)
    
    print('')
    print('df2: ')
    print(df2)
    print('df1+df2: ')
    print(df1 + df2)

    效果:

    df1: 
         a    b
    0  1.0  1.0
    1  1.0  1.0
    
    df2: 
         a    b    c
    0  1.0  1.0  1.0
    1  1.0  1.0  1.0
    2  1.0  1.0  1.0
    df1+df2: 
         a    b   c
    0  2.0  2.0 NaN
    1  2.0  2.0 NaN
    2  NaN  NaN NaN

    填充未对齐的数据进行运算

    1. fill_value

    使用add, sub, div, mul的同时,

    通过fill_value指定填充值,未对齐的数据将和填充值做运算

    import pandas as pd
    
    import numpy as np
    
    # df_obj = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
    # # 通过list构建Series
    # ser_data = {"a": 17.8, "b": 20.1, "c": 16.5,"d":12}
    # ser_obj = pd.Series(ser_data)
    s1 = pd.Series(range(10, 20), index = range(10))
    s2 = pd.Series(range(20, 25), index = range(5))
    print(s1)
    print(s2)
    
    print(s1.add(s2, fill_value = -1))
    df1 = pd.DataFrame(np.ones((2,2)), columns = ['a', 'b'])
    df2 = pd.DataFrame(np.ones((3,3)), columns = ['a', 'b', 'c'])
    print(df1)
    print(df2)
    
    print(df1.sub(df2, fill_value = 2.))

    效果

    0    10
    1    11
    2    12
    3    13
    4    14
    5    15
    6    16
    7    17
    8    18
    9    19
    dtype: int64
    0    20
    1    21
    2    22
    3    23
    4    24
    dtype: int64
    0    30.0
    1    32.0
    2    34.0
    3    36.0
    4    38.0
    5    14.0
    6    15.0
    7    16.0
    8    17.0
    9    18.0
    dtype: float64
         a    b
    0  1.0  1.0
    1  1.0  1.0
         a    b    c
    0  1.0  1.0  1.0
    1  1.0  1.0  1.0
    2  1.0  1.0  1.0
         a    b    c
    0  0.0  0.0  1.0
    1  0.0  0.0  1.0
    2  1.0  1.0  1.0

    Pandas的函数应用

    apply 和 applymap

    1. 可直接使用NumPy的函数

    df = pd.DataFrame(np.random.randn(5,4) - 1)
    print(df)
    
    print(np.abs(df))

    效果:

              0         1         2         3
    0 -0.638228 -0.615340 -2.416771 -0.521187
    1 -0.978901 -0.765940 -0.821583 -0.109666
    2 -0.182581 -0.820414 -0.497785  1.638130
    3 -1.398201  0.893015 -1.109652 -1.740068
    4 -0.079365 -0.750413  0.847062 -1.175580
              0         1         2         3
    0  0.638228  0.615340  2.416771  0.521187
    1  0.978901  0.765940  0.821583  0.109666
    2  0.182581  0.820414  0.497785  1.638130
    3  1.398201  0.893015  1.109652  1.740068
    4  0.079365  0.750413  0.847062  1.175580

    2. 通过apply将函数应用到列或行上

    df = pd.DataFrame(np.random.randn(5, 4) - 1)
    print(df)
    
    print(df.apply(lambda x: x.max()))

    效果:

             0         1         2         3
    0 -0.672592 -0.917094 -1.698291 -2.683744
    1 -1.593442  0.308978 -0.668113 -0.867197
    2 -1.023184 -0.406812 -1.993301 -0.516704
    3 -0.666674 -0.524327 -2.032358  0.192416
    4 -0.466286 -1.319539 -1.643544 -1.137968
    0   -0.466286
    1    0.308978
    2   -0.668113
    3    0.192416
    dtype: float64

    注意指定轴的方向,默认axis=0,方向是列

    df = pd.DataFrame(np.random.randn(5, 4) - 1)
    print(df)
    
    print(df.apply(lambda x: x.max()))
    # 指定轴方向,axis=1,方向是行
    print(df.apply(lambda x : x.max(), axis=1))

    效果

             0         1         2         3
    0 -1.053992 -0.627906 -2.195281 -0.433810
    1 -1.838847  0.821711  0.005306 -0.485479
    2 -0.194641 -0.608357  0.476059 -0.989364
    3 -0.935286  0.370543 -0.316234 -0.482919
    4 -0.142188 -2.685907 -0.757193 -0.150942
    0   -0.142188
    1    0.821711
    2    0.476059
    3   -0.150942
    dtype: float64
    0   -0.433810
    1    0.821711
    2    0.476059
    3    0.370543
    4   -0.142188
    dtype: float64

    3. 通过applymap将函数应用到每个数据上

    df = pd.DataFrame(np.random.randn(5, 4) - 1)
    print(df)
    
    # 使用applymap应用到每个数据
    f2 = lambda x : '%.2f' % x
    print(df.applymap(f2))

    效果

              0         1         2         3
    0 -1.477573 -2.256976 -1.665249  0.381750
    1 -1.748229 -0.457566 -1.138169 -1.741856
    2 -1.456192 -0.596993 -1.293459  1.057294
    3 -0.845528 -0.725874 -2.720255  0.472505
    4 -0.927104 -1.748213 -0.382931  0.046957
           0      1      2      3
    0  -1.48  -2.26  -1.67   0.38
    1  -1.75  -0.46  -1.14  -1.74
    2  -1.46  -0.60  -1.29   1.06
    3  -0.85  -0.73  -2.72   0.47
    4  -0.93  -1.75  -0.38   0.05

    排序

    1. 索引排序

    sort_index()

    排序默认使用升序排序,ascending=False 为降序排序

    s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
    print(s4)
    
    # 索引排序
    s4.sort_index() # 0 0 1 3 3
    print(s4.sort_index() )

    效果

    0    10
    2    11
    3    12
    4    13
    3    14
    dtype: int64
    0    10
    2    11
    3    12
    3    14
    4    13

    对DataFrame操作时注意轴方向

    df4 = pd.DataFrame(np.random.randn(3, 5),
                       index=np.random.randint(3, size=3),
                       columns=np.random.randint(5, size=5))
    print(df4)
    
    df4_isort = df4.sort_index(axis=1, ascending=False)
    print(df4_isort) # 4 2 1 1 0

    效果

              1         1         4         2         0
    0  0.661257 -1.022631  0.337867 -0.680210  0.018720
    2  0.486521 -0.617665 -1.566189  1.484633  0.284891
    2 -0.902534  2.621820 -0.278090 -0.807439  1.121617
              4         2         1         1         0
    0  0.337867 -0.680210  0.661257 -1.022631  0.018720
    2 -1.566189  1.484633  0.486521 -0.617665  0.284891
    2 -0.278090 -0.807439 -0.902534  2.621820  1.121617

    2. 按值排序

    sort_values(by='column name')

    根据某个唯一的列名进行排序,如果有其他相同列名则报错。

    df4 = pd.DataFrame(np.random.randn(3, 5))
    print(df4)
    # 按值排序
    df4_vsort = df4.sort_values(by=0, ascending=False)
    print(df4_vsort)
            0         1         2         3         4
    0 -0.579405  1.055458 -2.274356 -1.215769  1.582240
    1  2.081478 -0.687347  0.854755 -0.011375 -2.779123
    2  1.824004 -1.294691  0.940245  1.626087 -0.539030
              0         1         2         3         4
    1  2.081478 -0.687347  0.854755 -0.011375 -2.779123
    2  1.824004 -1.294691  0.940245  1.626087 -0.539030
    0 -0.579405  1.055458 -2.274356 -1.215769  1.582240

    处理缺失数据

    df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
                           [np.nan, 4., np.nan], [1., 2., 3.]])
    print(df_data.head())

    效果

              0         1         2
    0 -3.094288 -0.914912  2.419605
    1  1.000000  2.000000       NaN
    2       NaN  4.000000       NaN
    3  1.000000  2.000000  3.000000

    1. 判断是否存在缺失值:isnull()

    2. 丢弃缺失数据:dropna()

    根据axis轴方向,丢弃包含NaN的行或列

    3. 填充缺失数据:fillna()

    df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
                           [np.nan, 4., np.nan], [1., 2., 3.]])
    print(df_data.head())
    # isnull
    print(df_data.isnull())
    # dropna
    print(df_data.dropna())
    
    print(df_data.dropna(axis=1))
    # fillna
    print(df_data.fillna(-100.))

    效果

      0         1         2
    0 -0.390745  1.712754 -0.156704
    1  1.000000  2.000000       NaN
    2       NaN  4.000000       NaN
    3  1.000000  2.000000  3.000000
           0      1      2
    0  False  False  False
    1  False  False   True
    2   True  False   True
    3  False  False  False
              0         1         2
    0 -0.390745  1.712754 -0.156704
    3  1.000000  2.000000  3.000000
              1
    0  1.712754
    1  2.000000
    2  4.000000
    3  2.000000
                0         1           2
    0   -0.390745  1.712754   -0.156704
    1    1.000000  2.000000 -100.000000
    2 -100.000000  4.000000 -100.000000
    3    1.000000  2.000000    3.000000
  • 相关阅读:
    面向对象七大设计原则
    S2第六章
    S2第四章
    大数据Hadoop——HDFS Shell操作
    大数据Hadoop——初识Hadoop
    ssh整合oracle数据源报错????
    关于Struts2的通配方法、转发重定向
    SSH实现ajax
    关于Struts2自动装配和访问Servlet API
    Oracle函数
  • 原文地址:https://www.cnblogs.com/loaderman/p/11967210.html
Copyright © 2011-2022 走看看