zoukankan      html  css  js  c++  java
  • Pandas | GroupBy 分组

    任何分组(groupby)操作都涉及原始对象的以下操作之一:

    • 分割对象
    • 应用一个函数
    • 结合的结果

    在许多情况下,我们将数据分成多个集合,并在每个子集上应用一些函数。在应用函数中,可以执行以下操作:

    • 聚合 - 计算汇总统计;
    • 转换 - 执行一些特定于组的操作;
    • 过滤 - 在某些情况下丢弃数据;

    下面来看看创建一个DataFrame对象并对其执行所有操作 -

    import pandas as pd
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    df = pd.DataFrame(ipl_data)
    print (df)

    输出结果:

        Points  Rank    Team  Year
    0      876     1  Riders  2014
    1      789     2  Riders  2015
    2      863     2  Devils  2014
    3      673     3  Devils  2015
    4      741     3   Kings  2014
    5      812     4   kings  2015
    6      756     1   Kings  2016
    7      788     1   Kings  2017
    8      694     2  Riders  2016
    9      701     4  Royals  2014
    10     804     1  Royals  2015
    11     690     2  Riders  2017
    
     

    一、将数据拆分成组

    Pandas对象可以分成任何对象。有多种方式来拆分对象,如 -

    • obj.groupby(‘key’)
    • obj.groupby([‘key1’,’key2’])
    • obj.groupby(key,axis=1)

    现在来看看如何将分组对象应用于DataFrame对象

    示例

    import pandas as pd
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    print (df.groupby('Team'))

    输出结果:

    <pandas.core.groupby.DataFrameGroupBy object at 0x00000245D60AD518>
    
     

    二、查看分组

    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],           
            'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    print (df.groupby('Team').groups)
    输出结果:
    {
    'Devils': Int64Index([2, 3], dtype='int64'), 
    'Kings': Int64Index([4, 6, 7], dtype='int64'), 
    'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 
    'Royals': Int64Index([9, 10], dtype='int64'), 
    'kings': Int64Index([5], dtype='int64')
    }
    
     

    按多列分组

    import pandas as pd
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    print (df.groupby(['Team','Year']).groups)

    输出结果:

    {
    ('Devils', 2014): Int64Index([2], dtype='int64'), 
    ('Devils', 2015): Int64Index([3], dtype='int64'), 
    ('Kings', 2014): Int64Index([4], dtype='int64'),
    ('Kings', 2016): Int64Index([6], dtype='int64'),
    ('Kings', 2017): Int64Index([7], dtype='int64'), 
    ('Riders', 2014): Int64Index([0], dtype='int64'), 
    ('Riders', 2015): Int64Index([1], dtype='int64'), 
    ('Riders', 2016): Int64Index([8], dtype='int64'), 
    ('Riders', 2017): Int64Index([11], dtype='int64'),
    ('Royals', 2014): Int64Index([9], dtype='int64'), 
    ('Royals', 2015): Int64Index([10], dtype='int64'), 
    ('kings', 2015): Int64Index([5], dtype='int64')
    }
    
     

    三、迭代遍历分组

    使用groupby对象,可以遍历类似itertools.obj的对象。

    import pandas as pd
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Year')
    
    for name,group in grouped:
        print (name)
        print (group)
        print ('
    ')

    输出结果:

    2014
       Points  Rank    Team  Year
    0     876     1  Riders  2014
    2     863     2  Devils  2014
    4     741     3   Kings  2014
    9     701     4  Royals  2014
    
    2015 Points Rank Team Year 1 789 2 Riders 2015 3 673 3 Devils 2015 5 812 4 kings 2015 10 804 1 Royals 2015
    2016 Points Rank Team Year 6 756 1 Kings 2016 8 694 2 Riders 2016
    2017 Points Rank Team Year 7 788 1 Kings 2017 11 690 2 Riders 2017
     

    默认情况下,groupby对象具有与分组名相同的标签名称。

    四、选择一个分组

    使用get_group()方法,可以选择一个组。参考以下示例代码 -

    import pandas as pd
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Year')
    print (grouped.get_group(2014))

    输出结果:

       Points  Rank    Team  Year
    0     876     1  Riders  2014
    2     863     2  Devils  2014
    4     741     3   Kings  2014
    9     701     4  Royals  2014
    
     

    五、聚合

    聚合函数为每个组返回单个聚合值。当创建了分组(group by)对象,就可以对分组数据执行多个聚合操作。

    应用单个聚合函数

    import pandas as pd
    import numpy as np
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Year')
    print (grouped['Points'].agg(np.mean))

    输出结果:

    Year
    2014    795.25
    2015    769.50
    2016    725.00
    2017    739.00
    Name: Points, dtype: float64
    
     

    另一种查看每个分组的大小的方法是应用size()函数 -

    import pandas as pd
    import numpy as np
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Team')
    print (grouped.agg(np.size))

    输出结果:

    Team                      
    Devils       2     2     2
    Kings        3     3     3
    Riders       4     4     4
    Royals       2     2     2
    kings        1     1     1
    
     

    一次应用多个聚合函数

    通过分组系列,还可以传递函数的列表或字典来进行聚合,并生成DataFrame作为输出 

    import pandas as pd
    import numpy as np
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    
    print(df)
    print('
    ')
    
    grouped = df.groupby('Team')
    agg = grouped['Points'].agg([np.sum, np.mean, np.std])
    print (agg)

    输出结果:

          Team  Rank  Year  Points
    0 Riders 1 2014 876
    1 Riders 2 2015 789
    2 Devils 2 2014 863
    3 Devils 3 2015 673
    4 Kings 3 2014 741
    5 kings 4 2015 812
    6 Kings 1 2016 756
    7 Kings 1 2017 788
    8 Riders 2 2016 694
    9 Royals 4 2014 701
    10 Royals 1 2015 804
    11 Riders 2 2017 690


    sum mean std
    Team
    Devils 1536 768.000000 134.350288
    Kings 2285 761.666667 24.006943
    Riders 3049 762.250000 88.567771
    Royals 1505 752.500000 72.831998
    kings 812 812.000000 NaN
    
    
     

    六、转换

    分组或列上的转换返回索引大小与被分组的索引相同的对象。因此,转换应该返回与组块大小相同的结果。

    import pandas as pd
    import numpy as np
    
    ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
             'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
             'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
             'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
             'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
    
    df = pd.DataFrame(ipl_data)
    grouped = df.groupby('Team')
    score = lambda x: (x - x.mean()) / x.std()*10
    print (grouped.transform(score))

    输出结果:

           Points       Rank       Year
    0   12.843272 -15.000000 -11.618950
    1    3.020286   5.000000  -3.872983
    2    7.071068  -7.071068  -7.071068
    3   -7.071068   7.071068   7.071068
    4   -8.608621  11.547005 -10.910895
    5         NaN        NaN        NaN
    6   -2.360428  -5.773503   2.182179
    7   10.969049  -5.773503   8.728716
    8   -7.705963   5.000000   3.872983
    9   -7.071068   7.071068  -7.071068
    10   7.071068  -7.071068   7.071068
    11  -8.157595   5.000000  11.618950
    
     

    七、过滤

    过滤根据定义的标准过滤数据并返回数据的子集。filter()函数用于过滤数据。

    import pandas as pd
    import numpy as np
    ipl_data
    = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]} df = pd.DataFrame(ipl_data) filter = df.groupby('Team').filter(lambda x: len(x) >= 3) print (filter)

    输出结果:

        Points  Rank    Team  Year
    0      876     1  Riders  2014
    1      789     2  Riders  2015
    4      741     3   Kings  2014
    6      756     1   Kings  2016
    7      788     1   Kings  2017
    8      694     2  Riders  2016
    11     690     2  Riders  2017
    
     

    在上述过滤条件下,要求返回三次以上参加IPL的队伍。




  • 相关阅读:
    SQL语句容易出现错误的地方-连载
    3D坦克大战游戏iOS源代码
    node.js setup wizard ended prematurely Win7安装nodejs失败解决方法
    Linux环境高级编程--出错处理(CLStatus)
    HT系统部署操作文档
    VC中常见的108个问题
    菜鸟nginx源代码剖析数据结构篇(十一) 共享内存ngx_shm_t
    Smalidea无源代码调试 android 应用
    MPAndroidChart开源图表库(三)之柱形图
    intent 显式意图和隐式意图之Activity的激活方式
  • 原文地址:https://www.cnblogs.com/Summer-skr--blog/p/11705892.html
Copyright © 2011-2022 走看看