zoukankan      html  css  js  c++  java
  • 数据聚合与分组运算学习笔记1

    本章内容:

    • 根据一个或多个键拆分pandas对象;
    • 计算分数摘要统计;
    • 对DataFrame的列应用各种函数;
    • 应用组内转换或运算;
    • 计算透视表或交叉表;
    • 执行分位数分析等;

    1 GroupBy技术

    分组聚合演示:

    In [4]: df = DataFrame({'key1':list('aabba'),'key2':['one','two','one','two','o
       ...: ne'],'data1':np.random.randn(5),'data2':np.random.randn(5)})
    
    In [5]: df
    Out[5]:
          data1     data2 key1 key2
    0 -0.187814 -0.224238    a  one
    1 -1.363033  0.087923    a  two
    2  0.624801 -0.132735    b  one
    3  0.643536  0.335998    b  two
    4 -1.318154  0.208845    a  one
    
    In [6]: grouped=df['data1'].groupby(df['key1'])  # 将data1数据按照key1进行分组
    
    In [7]: grouped
    Out[7]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007D7FEF0>
    
    In [8]: grouped.mean()  # 求各分组的平均值
    Out[8]:
    key1
    a   -0.956334
    b    0.634169
    Name: data1, dtype: float64
    
    In [9]: means = df['data1'].groupby([df['key1'],df['key2']]).mean()
    
    In [10]: means   #分层索引 进行分组
    Out[10]:
    key1  key2
    a     one    -0.752984
          two    -1.363033
    b     one     0.624801
          two     0.643536
    Name: data1, dtype: float64
    
    In [11]: means.unstack()  # 解stack
    Out[11]:
    key2       one       two
    key1
    a    -0.752984 -1.363033
    b     0.624801  0.643536
    
    # 数据和分组是Series,其实也可以是其他任意的数组,arange
    
    In [12]: df.groupby('key1').mean()
    Out[12]:  # key2不是数值列,因此属于“麻烦列”,不能用于计算mean,所以去除掉了。
             data1     data2
    key1
    a    -0.956334  0.024177
    b     0.634169  0.101632
    
    In [13]: df.groupby(['key1','key2']).mean()
    Out[13]:
                  data1     data2
    key1 key2
    a    one  -0.752984 -0.007696
         two  -1.363033  0.087923
    b    one   0.624801 -0.132735
         two   0.643536  0.335998
    
    # 返回分组后,每组的数量
    In [17]: df.groupby(['key1','key2']).size()
    Out[17]:
    key1  key2
    a     one     2
         two     1
    b     one     1
         two     1
    dtype: int64
    
    

    1.1 对分组进行迭代

    对groupby对象,可以进行迭代。

    # name: set(key1)中的值,进行循环组名
    # group:每个name对应的df的子集
    In [18]: for name,group in df.groupby('key1'):
        ...:     print(name)
        ...:     print(group)
        ...:
    a
          data1     data2 key1 key2
    0 -0.187814 -0.224238    a  one
    1 -1.363033  0.087923    a  two
    4 -1.318154  0.208845    a  one
    b
          data1     data2 key1 key2
    2  0.624801 -0.132735    b  one
    3  0.643536  0.335998    b  two
    
    # 使用(k1,k2)元组来循环分组的组名
    In [20]: for (k1,k2),group in df.groupby(['key1','key2']):
        ...:     print(k1,k2)
        ...:     print(group)
        ...:
    ('a', 'one')
          data1     data2 key1 key2
    0 -0.187814 -0.224238    a  one
    4 -1.318154  0.208845    a  one
    ('a', 'two')
          data1     data2 key1 key2
    1 -1.363033  0.087923    a  two
    ('b', 'one')
          data1     data2 key1 key2
    2  0.624801 -0.132735    b  one
    ('b', 'two')
          data1     data2 key1 key2
    3  0.643536  0.335998    b  two
    
    # 可以进行字典、列表操作
    In [21]: dict(list(df.groupby('key1')))['b']
    Out[21]:
          data1     data2 key1 key2
    2  0.624801 -0.132735    b  one
    3  0.643536  0.335998    b  two
    
    In [24]: df.dtypes
    Out[24]:
    data1    float64
    data2    float64
    key1      object
    key2      object
    dtype: object
    
    # 默认使用axis=0进行分类,也可以按照列进行分类
    In [25]: df.groupby(df.dtypes,axis=1)
    Out[25]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007F024A8>
    
    In [26]: grouped = df.groupby(df.dtypes,axis=1)
    
    In [27]: dict(list(grouped))
    Out[27]:
    {dtype('float64'):       data1     data2
     0 -0.187814 -0.224238
     1 -1.363033  0.087923
     2  0.624801 -0.132735
     3  0.643536  0.335998
     4 -1.318154  0.208845, dtype('O'):   key1 key2
     0    a  one
     1    a  two
     2    b  one
     3    b  two
     4    a  one}
    
    

    1.2 选取一个或一组列

    In [28]: df.groupby('key1')['data1'] # 等价于[31]
    Out[28]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007F02550>
    
    In [29]: df.groupby('key1')[['data1','data2']] # 等价于[33]
    Out[29]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007E448D0
    
    In [31]: df['data1'].groupby(df['key1'])
    Out[31]: <pandas.core.groupby.SeriesGroupBy object at 0x0000000007F02198>
    
    In [33]: df[['data1','data2']].groupby(df['key1'])
    Out[33]: <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007FE8EF0>
    

    1.3 通过字典或Series进行分组

    In [35]: people = DataFrame(np.random.randn(5,5),columns=list('abcde'),index=['
        ...: Joe','Steve','Wes','Jim','Travis'])
    
    In [36]: people.ix[2:3,'b':'c'] = np.nan
    
    In [37]: people
    Out[37]:
                   a         b         c         d         e
    Joe     0.231921 -0.185109 -1.103585  0.919247  0.827190
    Steve   0.336541 -0.358981 -0.897516  0.772835 -1.774432
    Wes    -0.069938       NaN       NaN  0.268681 -1.199919
    Jim     1.324175 -0.968378 -2.376905 -0.222219 -2.366535
    Travis  0.815229 -1.139785 -2.045500 -1.576867 -0.578046
    
    # 用字典进行分组
    In [38]: mapping = {'a':'red','b':'green','c':'red','d':'red','e':'green'}
    
    In [39]: people.groupby(mapping,axis=1)
    Out[39]: <pandas.core.groupby.DataFrameGroupBy object at 0x000000000803A048>
    
    In [40]: people.groupby(mapping,axis=1).sum()  # 将列进行重新映射
    Out[40]:
               green       red
    Joe     0.642080  0.047583
    Steve  -2.133413  0.211860
    Wes    -1.199919  0.198743
    Jim    -3.334913 -1.274948
    Travis -1.717832 -2.807137
    
    # 同样的,也可以使用Series进行分组
    In [44]: mapping = Series(mapping)
    
    In [45]: people.groupby(mapping,axis=1).count()
    Out[45]:
            green  red
    Joe         2    3
    Steve       2    3
    Wes         1    2
    Jim         2    3
    Travis      2    3
    

    1.4 通过函数进行分组

    groupby也可以传入函数,分别在索引值上执行一便,并将返回值作为新的分组名称。

    In [46]: people
    Out[46]:
                   a         b         c         d         e
    Joe     0.231921 -0.185109 -1.103585  0.919247  0.827190
    Steve   0.336541 -0.358981 -0.897516  0.772835 -1.774432
    Wes    -0.069938       NaN       NaN  0.268681 -1.199919
    Jim     1.324175 -0.968378 -2.376905 -0.222219 -2.366535
    Travis  0.815229 -1.139785 -2.045500 -1.576867 -0.578046
    
    # 按照人名长度进行分类,并计算对应列的和
    In [47]: people.groupby(len).sum()
    Out[47]:
              a         b         c         d         e
    3  1.486159 -1.153487 -3.480490  0.965709 -2.739264
    5  0.336541 -0.358981 -0.897516  0.772835 -1.774432
    6  0.815229 -1.139785 -2.045500 -1.576867 -0.578046
    
    In [48]: key_list = ['one','one','one','two','two']
    # 按照人名、keylist进行分组,并计算组中最小值
    In [49]: people.groupby([len,key_list]).min()
    Out[49]:
                  a         b         c         d         e
    3 one -0.069938 -0.185109 -1.103585  0.268681 -1.199919
      two  1.324175 -0.968378 -2.376905 -0.222219 -2.366535
    5 one  0.336541 -0.358981 -0.897516  0.772835 -1.774432
    6 two  0.815229 -1.139785 -2.045500 -1.576867 -0.578046
    

    1.5 按照索引级别进行分组

    In [51]: columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5
        ...: ,1,3]],names=['cty','tenor'])
    
    In [52]: hier_df = DataFrame(np.random.randn(4,5), columns=columns)
    
    In [53]: hier_df
    Out[53]:
    cty          US                            JP
    tenor         1         3         5         1         3
    0     -0.295106  1.021162 -0.233712 -0.708512  0.617306
    1      0.404333  1.224998 -0.169200  0.939965 -1.030823
    2      0.095957  0.617305  0.867929  1.774633  0.800739
    3      1.433123  0.355095 -0.377080 -0.360470  0.690263
    
    In [54]: hier_df.groupby(level='cty',axis=1).count()  #按照级别cty进行分组
    Out[54]:
    cty  JP  US
    0     2   3
    1     2   3
    2     2   3
    3     2   3
    

    待续。。。

  • 相关阅读:
    BestCoder Round #86 1001
    Codeforces Round #365 (Div. 2) B
    Codeforces Round #365 (Div. 2) A
    Codeforces Round #129 (Div. 2) C
    Codeforces Round #129 (Div. 2) B
    Android.mk 文件语法详解
    RDS和ROS使用小结
    电力企业计量生产需求系统解决方案
    android 修改framework下资源文件后如何编译
    USB port 如何识别不同的Charger类型
  • 原文地址:https://www.cnblogs.com/felo/p/6393653.html
Copyright © 2011-2022 走看看