zoukankan      html  css  js  c++  java
  • 数据聚合与分组运算学习笔记2

    2 数据聚合

    从数组产生标量值的数据转化过程。

    In [5]: df
    Out[5]:
          data1     data2 key1 key2
    0 -1.809227  2.231607    a  one
    1  1.663738 -0.113191    a  two
    2  0.411805 -0.842909    b  one
    3  1.244803 -1.147212    b  two
    4  1.014822  0.122223    a  one
    
    In [6]: grouped = df.groupby('key1')
    
    In [7]: grouped['data1'].quantile(0.9)   # 0.9分位数,
    Out[7]:
    key1
    a    1.533955
    b    1.161503
    Name: data1, dtype: float64
    
    # 还有之前的sum,mean等。
    
    # 也能传入自定义函数
    
    In [5]: def peak_to_peak(arr):
       ...:     return arr.max()-arr.min()
    
    In [9]: grouped.agg(peak_to_peak)
    Out[9]:
             data1     data2
    key1                    
    a     3.472965  2.344798
    b     0.832998  0.304303
    
    
    In [10]: grouped.describe()
    Out[10]:
                   data1     data2
    key1                          
    a    count  3.000000  3.000000
         mean   0.289777  0.746880
         std    1.846521  1.291188
         min   -1.809227 -0.113191
         25%   -0.397203  0.004516
         50%    1.014822  0.122223
         75%    1.339280  1.176915
         max    1.663738  2.231607
    b    count  2.000000  2.000000
         mean   0.828304 -0.995061
         std    0.589018  0.215175
         min    0.411805 -1.147212
         25%    0.620055 -1.071137
         50%    0.828304 -0.995061
         75%    1.036554 -0.918985
         max    1.244803 -0.842909
    
    
    
    

    经过groupby优化过的方法

    2.1 面向列的多函数应用

    针对列的聚合运算其实就是agg或者使用mean、sum等方法。如果需要多个聚合函数同时作用:

    # 以列表的形式传入agg,元素为函数名称或函数对象。
    In [9]: df.groupby('key1').agg(['mean', peak_to_peak])
    Out[9]:
             data1                  data2
              mean peak_to_peak      mean peak_to_peak
    key1
    a     0.143435     2.047937  0.934671     1.279921
    b     0.037396     1.884494 -1.087873     1.018864
    
    # 还可以定义列名 (列名,函数名)
    In [10]: df.groupby('key1').agg([('Mean','mean'),('bar',np.std)])
    Out[10]:
             data1               data2
              Mean       bar      Mean       bar
    key1
    a     0.143435  1.125854  0.934671  0.673188
    b     0.037396  1.332538 -1.087873  0.720446
    
    # 对不同的列,应用不同的函数,使用字典
    
    In [11]: func_dict = {'data1':'mean','data2':np.std}
    
    In [12]: df.groupby('key1').agg(func_dict)
    Out[12]:
             data1     data2
    key1
    a     0.143435  0.673188
    b     0.037396  0.720446
    
    
    
    

    2.2 以无索引的形式返回聚合数据

    在groupby中默认会以唯一的分组键作为索引,如果参数中传入“as_index=False”即可禁止索引。

    In [13]: df.groupby('key1',as_index=False).mean()
    Out[13]:
      key1     data1     data2
    0    a  0.143435  0.934671
    1    b  0.037396 -1.087873
    
    # 相同的结果
    In [17]: df.groupby('key1').mean().reset_index()
    Out[17]:
      key1     data1     data2
    0    a  0.143435  0.934671
    1    b  0.037396 -1.087873
    

    3 分组级运算和转换

    聚合只是分组运算和转换的一个特例,他能够接受的是将一维数组简化为标量的函数。
    本节的transform和apply函数能够执行更多的分组运算

    In [19]: df
    Out[19]:
          data1     data2 key1 key2
    0  0.897195  0.693473    a  one
    1  0.683851  1.695230    a  two
    2 -0.904851 -0.578441    b  one
    3  0.979643 -1.597305    b  two
    4 -1.150742  0.415309    a  one
    
    In [22]: k1_mean = df.groupby('key1').mean().add_prefix('mean_')
    
    In [25]: df.merge(k1_mean,left_on='key1',right_index=True)
    Out[25]:
          data1     data2 key1 key2  mean_data1  mean_data2
    0  0.897195  0.693473    a  one    0.143435    0.934671
    1  0.683851  1.695230    a  two    0.143435    0.934671
    4 -1.150742  0.415309    a  one    0.143435    0.934671
    2 -0.904851 -0.578441    b  one    0.037396   -1.087873
    3  0.979643 -1.597305    b  two    0.037396   -1.087873
    
    # 这样稍显麻烦
    
    

    我们使用 transform 再来试一试。

    In [12]: key = ['one','two','one','two','one']
    
    In [13]: people
    Out[13]:
                   a         b         c         d         e
    Joe    -0.632221  0.074549 -0.964887 -2.144542  1.787703
    Steve  -0.442427  0.925263  0.646603  0.719694 -0.208777
    Wes    -0.019415  0.032652 -0.475912  0.490444 -0.305742
    Jim     0.214406 -0.263113 -1.867597  1.762480  1.024320
    Travis  0.348362 -0.650411 -1.660043 -0.332139 -0.669246
    
    In [14]: people.groupby(key).mean()  # 求用key分类后的平均值
    Out[14]:
                a         b         c         d         e
    one -0.101091 -0.181070 -1.033614 -0.662079  0.270905
    two -0.114010  0.331075 -0.610497  1.241087  0.407772
    
    In [15]: people.groupby(key).transform(np.mean)  # 将通过key计算出来的平均值,赋值回去
    Out[15]:
                   a         b         c         d         e
    Joe    -0.101091 -0.181070 -1.033614 -0.662079  0.270905
    Steve  -0.114010  0.331075 -0.610497  1.241087  0.407772
    Wes    -0.101091 -0.181070 -1.033614 -0.662079  0.270905
    Jim    -0.114010  0.331075 -0.610497  1.241087  0.407772
    Travis -0.101091 -0.181070 -1.033614 -0.662079  0.270905
    
    In [16]: def demean(arr):
        ...:     return arr-arr.mean()
        ...:
    
    # transform 会将传入的函数应用到各个分组中,并将结果放到适当的位置上去。
    #
    In [17]: demeaned = people.groupby(key).transform(demean)
    
    In [18]: demeaned
    Out[18]:
                   a         b         c         d         e
    Joe    -0.531130  0.255619  0.068727 -1.482463  1.516798
    Steve  -0.328417  0.594188  1.257100 -0.521393 -0.616549
    Wes     0.081676  0.213722  0.557702  1.152523 -0.576647
    Jim     0.328417 -0.594188 -1.257100  0.521393  0.616549
    Travis  0.449453 -0.469341 -0.626429  0.329940 -0.940151
    
    In [19]: demeaned.groupby(key).mean()
    Out[19]:
                    a             b             c             d    e
    one  1.850372e-17  1.850372e-17 -1.110223e-16 -5.551115e-17  0.0
    two  0.000000e+00  0.000000e+00 -1.110223e-16  0.000000e+00  0.0
    

    3.1 apply:一般性的“拆分-应用-合并”

    transform与agg是一样的,要么产生一个可以广播的标量,要么产生一个相同大小的数组。
    apply会将对象拆分成多个片段,然后对各个片段传入函数,最后尝试组合。
    即“拆分-应用-合并”。

    In [14]: tips = pd.read_csv('tips.csv')
    
    In [15]: tips['tip_pct']=tips['tip']/tips['total_bill']
    
    In [16]: tips
    Out[16]:
         total_bill   tip     sex smoker   day    time  size   tip_pct
    0         16.99  1.01  Female     No   Sun  Dinner     2  0.059447
    1         10.34  1.66    Male     No   Sun  Dinner     3  0.160542
    2         21.01  3.50    Male     No   Sun  Dinner     3  0.166587
    
    In [17]: def top(df, n=5, column = 'tip_pct'):
        ...:     return df.sort_index(by=column)[-n:]
        ...:
    
    In [18]: top(tips, n=6)
    C:UsersyangflAnaconda3Scriptsipython-script.py:2: FutureWarning: by argumen
    t to sort_index is deprecated, pls use .sort_values(by=...)
      import sys
    Out[18]:
         total_bill   tip     sex smoker  day    time  size   tip_pct
    109       14.31  4.00  Female    Yes  Sat  Dinner     2  0.279525
    183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
    232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
    67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
    178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
    172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345
    
    In [17]: def top(df, n=5, column = 'tip_pct'):
        ...:     return df.sort_index(by=column)[-n:]
        ...:
    
    In [18]: top(tips, n=6)
    C:UsersyangflAnaconda3Scriptsipython-script.py:2: FutureWarning: by argumen
    t to sort_index is deprecated, pls use .sort_values(by=...)
      import sys
    Out[18]:
         total_bill   tip     sex smoker  day    time  size   tip_pct
    109       14.31  4.00  Female    Yes  Sat  Dinner     2  0.279525
    183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
    232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
    67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
    178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
    172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345
    
    In [19]: tips.groupby('smoker').apply(top)
    C:UsersyangflAnaconda3Scriptsipython-script.py:2: FutureWarning: by argumen
    t to sort_index is deprecated, pls use .sort_values(by=...)
      import sys
    Out[19]:
                total_bill   tip     sex smoker   day    time  size   tip_pct
    smoker
    No     88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
           185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
           51        10.29  2.60  Female     No   Sun  Dinner     2  0.252672
           149        7.51  2.00    Male     No  Thur   Lunch     2  0.266312
           232       11.61  3.39    Male     No   Sat  Dinner     2  0.291990
    Yes    109       14.31  4.00  Female    Yes   Sat  Dinner     2  0.279525
           183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
           67         3.07  1.00  Female    Yes   Sat  Dinner     1  0.325733
           178        9.60  4.00  Female    Yes   Sun  Dinner     2  0.416667
           172        7.25  5.15    Male    Yes   Sun  Dinner     2  0.710345
    
    In [20]: result = tips.groupby('smoker')['tip_pct'].describe()
    
    In [21]: result
    Out[21]:
    smoker
    No      count    151.000000
           mean       0.159328
           std        0.039910
           min        0.056797
           25%        0.136906
           50%        0.155625
           75%        0.185014
           max        0.291990
    Yes     count     93.000000
           mean       0.163196
           std        0.085119
           min        0.035638
           25%        0.106771
           50%        0.153846
           75%        0.195059
           max        0.710345
    Name: tip_pct, dtype: float64
    
    

    禁止分组键

    In [22]: tips.groupby('smoker', group_keys=False).apply(top)
    C:UsersyangflAnaconda3Scriptsipython-script.py:2: FutureWarning: by argumen
    t to sort_index is deprecated, pls use .sort_values(by=...)
      import sys
    Out[22]:
         total_bill   tip     sex smoker   day    time  size   tip_pct
    88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
    185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
    51        10.29  2.60  Female     No   Sun  Dinner     2  0.252672
    149        7.51  2.00    Male     No  Thur   Lunch     2  0.266312
    232       11.61  3.39    Male     No   Sat  Dinner     2  0.291990
    109       14.31  4.00  Female    Yes   Sat  Dinner     2  0.279525
    183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
    67         3.07  1.00  Female    Yes   Sat  Dinner     1  0.325733
    178        9.60  4.00  Female    Yes   Sun  Dinner     2  0.416667
    172        7.25  5.15    Male    Yes   Sun  Dinner     2  0.710345
    
    

    3.2 分位数和桶分析

    分位数

    In [23]: frame = DataFrame({'data1':np.random.randn(1000),'data2':np.random.ran
        ...: dn(1000)})
    
    In [24]: factor = pd.cut(frame.data1, 4)  #分成四份
    
    In [25]: factor[:5]
    Out[25]:
    0    (-0.122, 1.486]
    1    (-1.73, -0.122]
    2    (1.486, 3.0932]
    3    (-1.73, -0.122]
    4    (-0.122, 1.486]
    Name: data1, dtype: category
    Categories (4, object): [(-3.344, -1.73] < (-1.73, -0.122] < (-0.122, 1.486] < (
    1.486, 3.0932]]
    
    In [26]: def get_stats(group):
        ...:     return {"min":group.min(),
        ...:     'max':group.max(),
        ...:     'count':group.count(),
        ...:     'mean':group.mean()}
        ...:
    
    In [27]: grouped = frame.data2.groupby(factor)
    
    In [28]: grouped.apply(get_stats).unstack()
    Out[28]:
                     count       max      mean       min
    data1
    (-3.344, -1.73]   38.0  2.219138  0.068483 -2.613319
    (-1.73, -0.122]  421.0  3.024953 -0.043523 -2.733123
    (-0.122, 1.486]  472.0  3.882440 -0.023289 -4.073087
    (1.486, 3.0932]   69.0  2.520630  0.067271 -2.593762
    

    桶分析

    In [29]: grouping = pd.qcut(frame.data1, 10, labels=False)
    
    In [30]: grouped= frame.data2.groupby(grouping)
    
    In [31]: grouped.apply(get_stats).unstack()
    Out[31]:
           count       max      mean       min
    data1
    0      100.0  2.219138 -0.140476 -2.613319
    1      100.0  2.877679  0.047305 -2.483696
    2      100.0  2.716929 -0.043014 -2.446612
    3      100.0  2.387355 -0.080337 -2.733123
    4      100.0  3.024953  0.031980 -2.157749
    5      100.0  2.617329  0.011424 -2.048438
    6      100.0  2.362811 -0.074931 -2.942770
    7      100.0  3.882440  0.024096 -3.038111
    8      100.0  2.587888 -0.014792 -4.073087
    9      100.0  2.520630  0.018031 -2.613203
    

    3.3 示例:用特定于分组的值填充缺失值

    In [32]: s = Series(np.random.randn(6))
    
    In [33]: s[::2] = np.nan
    
    In [34]: s
    Out[34]:
    0         NaN
    1    0.818362
    2         NaN
    3    0.624027
    4         NaN
    5   -1.260203
    dtype: float64
    
    In [35]: s.fillna(s.mean())  # 使用平均值填充空值
    Out[35]:
    0    0.060729
    1    0.818362
    2    0.060729
    3    0.624027
    4    0.060729
    5   -1.260203
    dtype: float64
    
    

    3.4 示例:随机采样和排列

    一副扑克牌:

    In [36]: suits = ['H','S','C','D']  # 四个花色
    
    In [42]: card_val = (list(range(1,11))+[10]*3) * 4  # 21点
    
    # A ~ K
    In [44]: base_names = ['A'] + list(range(2,11)) + ['J','Q','K']
    
    In [45]: cards =[]
    
    In [46]: for suit in suits:
        ...:     cards.extend(str(num)+suit for num in base_names)
        ...:
    # 做一个Series
    In [47]: deck = Series(card_val, index=cards)
    
    In [48]: deck
    Out[48]:
    AH      1
    2H      2
    3H      3
    4H      4
    5H      5
    6H      6
    ...
    

    随机抽出5张牌

    In [49]: def draw(deck,n=5):
        ...:     return deck.take(np.random.permutation(len(deck))[:n])
        ...:
    # np.random.permutation 随机排序
    In [50]: draw(deck)  # 随机抽出5张牌
    Out[50]:
    7C      7
    2D      2
    10D    10
    2H      2
    AD      1
    dtype: int64
    

    随机每种花色抽取两张牌

    In [51]: get_suit = lambda card:card[-1]
    
    In [52]: deck.groupby(get_suit).apply(draw, 2)
    Out[52]:
    C  2C      2
       9C      9
    D  8D      8
       6D      6
    H  AH      1
       10H    10
    S  3S      3
       6S      6
    dtype: int64
    

    3.5 示例:分组加权平均数和相关系数

    In [54]: df = DataFrame({'category':list('aaaabbbb'),
        ...: 'data':np.random.randn(8),
        ...: 'weights':np.random.rand(8)})
    
    In [55]: df
    Out[55]:
      category      data   weights
    0        a -0.239087  0.172904
    1        a -0.615509  0.511186
    2        a  1.015113  0.778079
    3        a -0.726262  0.284090
    4        b  0.443518  0.189707
    5        b  1.025131  0.764808
    6        b -1.725867  0.108514
    7        b  0.613947  0.696840
    
    In [56]: grouped = df.groupby('category')  # 分组
    
    # 加权平均值
    In [57]: get_wavg = lambda g: np.average(g['data'],weights=g['weights'])
    
    In [58]: grouped.apply(get_wavg)
    Out[58]:
    category
    a    0.130299
    b    0.629996
    dtype: float64
    
  • 相关阅读:
    【React Native】使用react-native-wechat 进行微信好友、微信朋友圈进行分享
    【React Native】在网页中打开Android应用程序
    【React Native】错误提示: 在React开发中,我们可能经常会遇到这个一个警告: Can't perform a React state update on an unmounted component.
    【Swift】上传图片到腾讯云(生成token,上传)
    【React Native】集成声网Agora语音通讯
    【React Native】在React Native项目中使用rollbar-react-native记录
    【iOS】上传图片到七牛过程(生成Token,上传)
    【React Native】同步获取本地缓存键值对
    React Native之持久化存储(AsyncStorage、react-native-storage)的使用
    【React Native】手写Alert弹窗(单按钮,两个按钮)
  • 原文地址:https://www.cnblogs.com/felo/p/6407580.html
Copyright © 2011-2022 走看看