zoukankan      html  css  js  c++  java
  • pandas 之 groupby 聚合函数

    import numpy as np 
    import pandas as pd
    

    聚合函数

    Aggregations refer to any data transformation that produces scalar values from arrays(输入是数组, 输出是标量值). The preceding examples have used several of them, including mean, count, min, and sum You may wonder what is going on when you invoke mean() on a GroupBy object, Many common aggregations such as those found in Table 10-1, have optimized implementations. However, you are not limited to only this set of methods.

    • count
    • sum
    • mean
    • median
    • std, var
    • min, max
    • prod
    • first, last

    You can use aggregations of your own devising and additionally call any method that is also dedined on the grouped object.
    For example, you might recall that quantile computes sample quantiles of a Series or a DataFrame.

    While quantile is not explicitly implemented for GroupBy, it's a Series method an thus available for use. Internally, GroupBy efficiently slices up the Series, call piece.quantile(0.9) for each piece, and then assembles those results together into the result object:

    您可以使用您自己设计的聚合,并额外调用在分组对象上也禁用的任何方法。例如,您可能还记得分位数计算序列或数据流的样本分位数。虽然分位数没有显式地为GroupBy实现,但它是一个系列方法,因此可以使用。在内部,GroupBy有效地分割该系列,为每个片段调用piece.quantile(0.9),然后将这些结果组合到result对象中

    df = pd.DataFrame({
        'key1': 'a a b b a'.split(),
        'key2': ['one', 'two', 'one', 'two', 'one'],
        'data1': np.random.randn(5),
        'data2': np.random.randn(5)
    })
    
    df
    
    key1 key2 data1 data2
    0 a one 1.296733 -0.756093
    1 a two -1.389859 -1.027718
    2 b one -0.846801 -0.802681
    3 b two 1.200620 -1.328187
    4 a one 0.002991 -1.223807
    grouped = df.groupby('key1')
    
    grouped['data1'].quantile(0.9)  # 0.9分位数
    
    key1
    a    1.037985
    b    0.995878
    Name: data1, dtype: float64
    

    To use your own aggregation functions, pass any function that aggregates an array to the aggregate or agg method

    def peak_to_peak(arr):
        """计算数组的极差"""
        return arr.max() - arr.min()
    
    grouped.agg(peak_to_peak)  # 计算各组类的极差, 类似apply
    
    size tip tip_pct total_bill
    day smoker
    Fri No 1 2.00 0.067349 10.29
    Yes 3 3.73 0.159925 34.42
    Sat No 3 8.00 0.235193 41.08
    Yes 4 9.00 0.290095 47.74
    Sun No 4 4.99 0.193226 39.40
    Yes 3 5.00 0.644685 38.10
    Thur No 5 5.45 0.193350 33.68
    Yes 2 3.00 0.151240 32.77

    You may notice that some methods like describe also work, even though they are not aggregations, strictly speaking(鸭子类型吗?):

    grouped.describe()  # 描述分组信息, 只会对数值相关的哦
    
    data1 data2
    count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
    key1
    a 3.0 -0.030045 1.343601 -1.389859 -0.693434 0.002991 0.649862 1.296733 3.0 -1.002539 0.234871 -1.223807 -1.125762 -1.027718 -0.891905 -0.756093
    b 2.0 0.176910 1.447745 -0.846801 -0.334946 0.176910 0.688765 1.200620 2.0 -1.065434 0.371589 -1.328187 -1.196811 -1.065434 -0.934057 -0.802681

    I will explain in more detail what has happend here in Section 10.3 Apply: General split-apply-combine on page 302

    列的多功能扩展

    Let's return to the tipping dataset from earlier examples. After loading it with read_csv, we add a tipping percentage column tip_pct

    新增一列, tip所占的百分比

    tips = pd.read_csv('../examples/tips.csv')
    
    tips.info()
    
    tips.head()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 244 entries, 0 to 243
    Data columns (total 6 columns):
    total_bill    244 non-null float64
    tip           244 non-null float64
    smoker        244 non-null object
    day           244 non-null object
    time          244 non-null object
    size          244 non-null int64
    dtypes: float64(2), int64(1), object(3)
    memory usage: 11.5+ KB
    
    total_bill tip smoker day time size
    0 16.99 1.01 No Sun Dinner 2
    1 10.34 1.66 No Sun Dinner 3
    2 21.01 3.50 No Sun Dinner 3
    3 23.68 3.31 No Sun Dinner 2
    4 24.59 3.61 No Sun Dinner 4
    "新增一列 tip_pct"
    
    tips['tip_pct'] = tips['tip'] / tips['total_bill']
    
    tips[:6]
    
    '新增一列 tip_pct'
    
    total_bill tip smoker day time size tip_pct
    0 16.99 1.01 No Sun Dinner 2 0.059447
    1 10.34 1.66 No Sun Dinner 3 0.160542
    2 21.01 3.50 No Sun Dinner 3 0.166587
    3 23.68 3.31 No Sun Dinner 2 0.139780
    4 24.59 3.61 No Sun Dinner 4 0.146808
    5 25.29 4.71 No Sun Dinner 4 0.186240

    As you've already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std. However, you may want to aggregate using a different function depending o the column, or multiple functions at once. Fortunately, this is possible to do, which i'll illustrate through a number of examples. First, i'll group the tips by day and smoker:

    对Series, DataFrame 的某列or所有的一个聚合, 通常会使用一个聚合函数去映射, 然而, 你可能想要总使用不同的函数去映射不同的列. 幸运的是,这是可以做到的,我将通过一些例子来说明这一点. 首先,我将把每天的小费和吸烟者分成两组

    grouped = tips.groupby(['day', 'smoker'])
    

    Note that for descriptive statistics like those in Table 10-1, you can pass the name of the function a s a string:

    grouped_pct = grouped['tip_pct']
    
    grouped_pct.agg('mean')
    
    day   smoker
    Fri   No        0.151650
          Yes       0.174783
    Sat   No        0.158048
          Yes       0.147906
    Sun   No        0.160113
          Yes       0.187250
    Thur  No        0.160298
          Yes       0.163863
    Name: tip_pct, dtype: float64
    
    # cj  分组统计一波
    grouped_pct.agg('count')
    
    day   smoker
    Fri   No         4
          Yes       15
    Sat   No        45
          Yes       42
    Sun   No        57
          Yes       19
    Thur  No        45
          Yes       17
    Name: tip_pct, dtype: int64
    

    If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

    "对1or多个列, 进行1or多个聚合, 并排展开, 厉害了"
    
    grouped_pct.agg(['mean', 'std', peak_to_peak])
    
    '对1or多个列, 进行1or多个聚合, 并排展开, 厉害了'
    
    mean std peak_to_peak
    day smoker
    Fri No 0.151650 0.028123 0.067349
    Yes 0.174783 0.051293 0.159925
    Sat No 0.158048 0.039767 0.235193
    Yes 0.147906 0.061375 0.290095
    Sun No 0.160113 0.042347 0.193226
    Yes 0.187250 0.154134 0.644685
    Thur No 0.160298 0.038774 0.193350
    Yes 0.163863 0.039389 0.151240

    Here we passed a list of aggregations functions to agg to evaluate indepedently on the data groups.

    You don't need to accept the names that GroupBy gives to the columns; notably(尤其) lambda functions have the name <lambda which makes them hard to identify(you can see for yourself by looking at a function's __ name__ attribute.) Thus, if you pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names.
    (you can think of a list of 2-tuple as an ordered mapping)

    "给分组字段取别名"
    grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
    
    '给分组字段取别名'
    
    foo bar
    day smoker
    Fri No 0.151650 0.028123
    Yes 0.174783 0.051293
    Sat No 0.158048 0.039767
    Yes 0.147906 0.061375
    Sun No 0.160113 0.042347
    Yes 0.187250 0.154134
    Thur No 0.160298 0.038774
    Yes 0.163863 0.039389

    With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column (对不同的列进行不同的函数映射apply). To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:

    functions = ['count', 'mean', 'max']
    
    "实现对任意字段的任意操作, 分别"
    result = grouped['tip_pct', 'total_bill'].agg(functions)
    
    result
    
    '实现对任意字段的任意操作, 分别'
    
    tip_pct total_bill
    count mean max count mean max
    day smoker
    Fri No 4 0.151650 0.187735 4 18.420000 22.75
    Yes 15 0.174783 0.263480 15 16.813333 40.17
    Sat No 45 0.158048 0.291990 45 19.661778 48.33
    Yes 42 0.147906 0.325733 42 21.276667 50.81
    Sun No 57 0.160113 0.252672 57 20.506667 48.17
    Yes 19 0.187250 0.710345 19 24.120000 45.35
    Thur No 45 0.160298 0.266312 45 17.113111 41.19
    Yes 17 0.163863 0.241255 17 19.190588 43.11

    As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using concat to glue(粘合) the results together using the column names as the keys argument.

    result['tip_pct']  # 多层索引的选取哦
    
    count mean max
    day smoker
    Fri No 4 0.151650 0.187735
    Yes 15 0.174783 0.263480
    Sat No 45 0.158048 0.291990
    Yes 42 0.147906 0.325733
    Sun No 57 0.160113 0.252672
    Yes 19 0.187250 0.710345
    Thur No 45 0.160298 0.266312
    Yes 17 0.163863 0.241255

    As befor, a list of tuples with custom names can be passed:

    ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
    
    grouped['tip_pct', 'total_bill'].agg(ftuples)
    
    tip_pct total_bill
    Durchschnitt Abweichung Durchschnitt Abweichung
    day smoker
    Fri No 0.151650 0.000791 18.420000 25.596333
    Yes 0.174783 0.002631 16.813333 82.562438
    Sat No 0.158048 0.001581 19.661778 79.908965
    Yes 0.147906 0.003767 21.276667 101.387535
    Sun No 0.160113 0.001793 20.506667 66.099980
    Yes 0.187250 0.023757 24.120000 109.046044
    Thur No 0.160298 0.001503 17.113111 59.625081
    Yes 0.163863 0.001551 19.190588 69.808518

    Now suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dict to agg that contains a mapping of column names to any of the function specifications listed so far:

    grouped.agg({'tip':np.max, 'size': 'sum'})
    
    tip size
    day smoker
    Fri No 3.50 9
    Yes 4.73 31
    Sat No 9.00 115
    Yes 10.00 104
    Sun No 6.00 167
    Yes 6.50 49
    Thur No 6.70 112
    Yes 5.00 40
    grouped.agg({'tip_pct':['min', 'max', 'mean', 'std', 'sum'], 
                'size':'sum'})
    
    tip_pct size
    min max mean std sum sum
    day smoker
    Fri No 0.120385 0.187735 0.151650 0.028123 0.606602 9
    Yes 0.103555 0.263480 0.174783 0.051293 2.621746 31
    Sat No 0.056797 0.291990 0.158048 0.039767 7.112145 115
    Yes 0.035638 0.325733 0.147906 0.061375 6.212055 104
    Sun No 0.059447 0.252672 0.160113 0.042347 9.126438 167
    Yes 0.065660 0.710345 0.187250 0.154134 3.557756 49
    Thur No 0.072961 0.266312 0.160298 0.038774 7.213414 112
    Yes 0.090014 0.241255 0.163863 0.039389 2.785676 40

    A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

    结果去掉行索引

    • as_index=False

    In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn't always describe, you can diable this behavior in most cases by passing as_index=False to groupby:

    tips.groupby(['day', 'smoker'], as_index=False).mean()
    
    day smoker total_bill tip size tip_pct
    0 Fri No 18.420000 2.812500 2.250000 0.151650
    1 Fri Yes 16.813333 2.714000 2.066667 0.174783
    2 Sat No 19.661778 3.102889 2.555556 0.158048
    3 Sat Yes 21.276667 2.875476 2.476190 0.147906
    4 Sun No 20.506667 3.167895 2.929825 0.160113
    5 Sun Yes 24.120000 3.516842 2.578947 0.187250
    6 Thur No 17.113111 2.673778 2.488889 0.160298
    7 Thur Yes 19.190588 3.030000 2.352941 0.163863

    Of course, it's always possible to obtain the result in this format by calling reset_index on the result. Using the as_index=False method avoids some unnecessary computations.

  • 相关阅读:
    如何将一个用utf-8编码的文本用java程序转换成ANSI编码的文本
    【笔记】Nginx热更新相关知识
    网站性能测试工具 webbench 的安装和使用
    Windows 7环境下网站性能测试小工具 Apache Bench 和 Webbench使用和下载
    【笔记】Rancher2.1容器云平台新特性
    MinTTY终端模拟器要点
    CEBX格式的文档如何转换为PDF格式文档、DOCX文档?
    Rancher2.0与DataDog集成部署
    使用Docker方式创建3节点的Etcd集群
    NTP服务器时间同步
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/12002757.html
Copyright © 2011-2022 走看看