zoukankan      html  css  js  c++  java
  • 数据可视化基础专题(三十四):Pandas基础(十四) 分组(二)Aggregation/apply


    Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. These operations are similar to the aggregating APIwindow API, and resample API.

    An obvious one is aggregation via the aggregate() or equivalently agg() method:

    In [67]: grouped = df.groupby("A")
    In [68]: grouped.aggregate(np.sum)
                C         D
    bar  0.392940  1.732707
    foo -1.796421  2.824590
    In [69]: grouped = df.groupby(["A", "B"])
    In [70]: grouped.aggregate(np.sum)
                      C         D
    A   B                        
    bar one    0.254161  1.511763
        three  0.215897 -0.990582
        two   -0.077118  1.211526
    foo one   -0.983776  1.614581
        three -0.862495  0.024580
        two    0.049851  1.185429

    As you can see, the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index option:

    In [71]: grouped = df.groupby(["A", "B"], as_index=False)
    In [72]: grouped.aggregate(np.sum)
         A      B         C         D
    0  bar    one  0.254161  1.511763
    1  bar  three  0.215897 -0.990582
    2  bar    two -0.077118  1.211526
    3  foo    one -0.983776  1.614581
    4  foo  three -0.862495  0.024580
    5  foo    two  0.049851  1.185429
    In [73]: df.groupby("A", as_index=False).sum()
         A         C         D
    0  bar  0.392940  1.732707
    1  foo -1.796421  2.824590

    Note that you could use the reset_index DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex:

    In [74]: df.groupby(["A", "B"]).sum().reset_index()
         A      B         C         D
    0  bar    one  0.254161  1.511763
    1  bar  three  0.215897 -0.990582
    2  bar    two -0.077118  1.211526
    3  foo    one -0.983776  1.614581
    4  foo  three -0.862495  0.024580
    5  foo    two  0.049851  1.185429

    Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.

    In [75]: grouped.size()
         A      B  size
    0  bar    one     1
    1  bar  three     1
    2  bar    two     1
    3  foo    one     2
    4  foo  three     1
    5  foo    two     2
    In [76]: grouped.describe()
          C                                                    ...         D                                                  
      count      mean       std       min       25%       50%  ...       std       min       25%       50%       75%       max
    0   1.0  0.254161       NaN  0.254161  0.254161  0.254161  ...       NaN  1.511763  1.511763  1.511763  1.511763  1.511763
    1   1.0  0.215897       NaN  0.215897  0.215897  0.215897  ...       NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582
    2   1.0 -0.077118       NaN -0.077118 -0.077118 -0.077118  ...       NaN  1.211526  1.211526  1.211526  1.211526  1.211526
    3   2.0 -0.491888  0.117887 -0.575247 -0.533567 -0.491888  ...  0.761937  0.268520  0.537905  0.807291  1.076676  1.346061
    4   1.0 -0.862495       NaN -0.862495 -0.862495 -0.862495  ...       NaN  0.024580  0.024580  0.024580  0.024580  0.024580
    5   2.0  0.024925  1.652692 -1.143704 -0.559389  0.024925  ...  1.462816 -0.441652  0.075531  0.592714  1.109898  1.627081
    [6 rows x 16 columns]

    Another aggregation example is to compute the number of unique values of each group. This is similar to the value_counts function, except that it only counts unique values.

    In [77]: ll = [['foo', 1], ['foo', 2], ['foo', 2], ['bar', 1], ['bar', 1]]
    In [78]: df4 = pd.DataFrame(ll, columns=["A", "B"])
    In [79]: df4
         A  B
    0  foo  1
    1  foo  2
    2  foo  2
    3  bar  1
    4  bar  1
    In [80]: df4.groupby("A")["B"].nunique()
    bar    1
    foo    2
    Name: B, dtype: int64

    Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:




    Compute mean of groups


    Compute sum of group values


    Compute group sizes


    Compute count of group


    Standard deviation of groups


    Compute variance of groups


    Standard error of the mean of groups


    Generates descriptive statistics


    Compute first of group values


    Compute last of group values


    Take nth value, or a subset if n is a list


    Compute min of group values


    Compute max of group values

    The aggregating functions above will exclude NA values. Any function which reduces a Series to a scalar value is an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser: 1). Note that nth() can act as a reducer or a filter, see here.

    1 Applying multiple functions at once

    With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:

    In [81]: grouped = df.groupby("A")
    In [82]: grouped["C"].agg([np.sum, np.mean, np.std])
              sum      mean       std
    bar  0.392940  0.130980  0.181231
    foo -1.796421 -0.359284  0.912265

    On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:

    In [83]: grouped.agg([np.sum, np.mean, np.std])
                C                             D                    
              sum      mean       std       sum      mean       std
    bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
    foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

    The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this:

    In [84]: (
       ....:     grouped["C"]
       ....:     .agg([np.sum, np.mean, np.std])
       ....:     .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
       ....: )
              foo       bar       baz
    bar  0.392940  0.130980  0.181231
    foo -1.796421 -0.359284  0.912265

    For a grouped DataFrame, you can rename in a similar manner:

    In [85]: (
       ....:     grouped.agg([np.sum, np.mean, np.std]).rename(
       ....:         columns={"sum": "foo", "mean": "bar", "std": "baz"}
       ....:     )
       ....: )
                C                             D                    
              foo       bar       baz       foo       bar       baz
    bar  0.392940  0.130980  0.181231  1.732707  0.577569  1.366330
    foo -1.796421 -0.359284  0.912265  2.824590  0.564918  0.884785

    In general, the output column names should be unique. You can’t apply the same function (or two functions with the same name) to the same column.

    In [86]: grouped["C"].agg(["sum", "sum"])
              sum       sum
    bar  0.392940  0.392940
    foo -1.796421 -1.796421

    pandas does allow you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless) lambda functions, appending _<i> to each subsequent lambda.

    In [87]: grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
         <lambda_0>  <lambda_1>
    bar    0.331279    0.084917
    foo    2.337259   -0.215962

    2 Named aggregation

    To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg(), known as “named aggregation”, where

    • The keywords are the output column names

    • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas provides the pandas.NamedAgg namedtuple with the fields ['column', 'aggfunc'] to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.

    In [88]: animals = pd.DataFrame(
       ....:     {
       ....:         "kind": ["cat", "dog", "cat", "dog"],
       ....:         "height": [9.1, 6.0, 9.5, 34.0],
       ....:         "weight": [7.9, 7.5, 9.9, 198.0],
       ....:     }
       ....: )
    In [89]: animals
      kind  height  weight
    0  cat     9.1     7.9
    1  dog     6.0     7.5
    2  cat     9.5     9.9
    3  dog    34.0   198.0
    In [90]: animals.groupby("kind").agg(
       ....:     min_height=pd.NamedAgg(column="height", aggfunc="min"),
       ....:     max_height=pd.NamedAgg(column="height", aggfunc="max"),
       ....:     average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
       ....: )
          min_height  max_height  average_weight
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

    pandas.NamedAgg is just a namedtuple. Plain tuples are allowed as well.

    In [91]: animals.groupby("kind").agg(
       ....:     min_height=("height", "min"),
       ....:     max_height=("height", "max"),
       ....:     average_weight=("weight", np.mean),
       ....: )
          min_height  max_height  average_weight
    cat          9.1         9.5            8.90
    dog          6.0        34.0          102.75

    If your desired output column names are not valid Python keywords, construct a dictionary and unpack the keyword arguments

    In [92]: animals.groupby("kind").agg(
       ....:     **{
       ....:         "total weight": pd.NamedAgg(column="weight", aggfunc=sum)
       ....:     }
       ....: )
          total weight
    cat           17.8
    dog          205.5

    Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially apply them with functools.partial().


    For Python 3.5 and earlier, the order of **kwargs in a functions was not preserved. This means that the output column ordering would not be consistent. To ensure consistent ordering, the keys (and so output columns) will always be sorted for Python 3.5.

    Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

    In [93]: animals.groupby("kind").height.agg(
       ....:     min_height="min",
       ....:     max_height="max",
       ....: )
          min_height  max_height
    cat          9.1         9.5
    dog          6.0        34.0

    3 Applying different functions to DataFrame columns

    By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:

    In [94]: grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
                C         D
    bar  0.392940  1.366330
    foo -1.796421  0.884785

    The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or available via dispatching:

    In [95]: grouped.agg({"C": "sum", "D": "std"})
                C         D
    bar  0.392940  1.366330
    foo -1.796421  0.884785

    4 Cython-optimized aggregation functions

    Some common aggregations, currently only summeanstd, and sem, have optimized Cython implementations:

    In [96]: df.groupby("A").sum()
                C         D
    bar  0.392940  1.732707
    foo -1.796421  2.824590
    In [97]: df.groupby(["A", "B"]).mean()
                      C         D
    A   B                        
    bar one    0.254161  1.511763
        three  0.215897 -0.990582
        two   -0.077118  1.211526
    foo one   -0.491888  0.807291
        three -0.862495  0.024580
        two    0.024925  0.592714

    Of course sum and mean are implemented on pandas objects, so the above code would work even without the special versions via dispatching (see below).

    Flexible apply

    Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for both aggregate and transform in many standard use cases. However, apply can handle some exceptional use cases, for example:

    In [156]: df
         A      B         C         D
    0  foo    one -0.575247  1.346061
    1  bar    one  0.254161  1.511763
    2  foo    two -1.143704  1.627081
    3  bar  three  0.215897 -0.990582
    4  foo    two  1.193555 -0.441652
    5  bar    two -0.077118  1.211526
    6  foo    one -0.408530  0.268520
    7  foo  three -0.862495  0.024580
    In [157]: grouped = df.groupby("A")
    # could also just call .describe()
    In [158]: grouped["C"].apply(lambda x: x.describe())
    bar  count    3.000000
         mean     0.130980
         std      0.181231
         min     -0.077118
         25%      0.069390
    foo  min     -1.143704
         25%     -0.862495
         50%     -0.575247
         75%     -0.408530
         max      1.193555
    Name: C, Length: 16, dtype: float64

    The dimension of the returned result can also change:

    In [159]: grouped = df.groupby('A')['C']
    In [160]: def f(group):
       .....:     return pd.DataFrame({'original': group,
       .....:                          'demeaned': group - group.mean()})
    In [161]: grouped.apply(f)
       original  demeaned
    0 -0.575247 -0.215962
    1  0.254161  0.123181
    2 -1.143704 -0.784420
    3  0.215897  0.084917
    4  1.193555  1.552839
    5 -0.077118 -0.208098
    6 -0.408530 -0.049245
    7 -0.862495 -0.503211

    apply on a Series can operate on a returned value from the applied function, that is itself a series, and possibly upcast the result to a DataFrame:

    In [162]: def f(x):
       .....:     return pd.Series([x, x ** 2], index=["x", "x^2"])
    In [163]: s = pd.Series(np.random.rand(5))
    In [164]: s
    0    0.321438
    1    0.493496
    2    0.139505
    3    0.910103
    4    0.194158
    dtype: float64
    In [165]: s.apply(f)
              x       x^2
    0  0.321438  0.103323
    1  0.493496  0.243538
    2  0.139505  0.019462
    3  0.910103  0.828287
    4  0.194158  0.037697
  • 相关阅读:
    AtCoder Grand Contest 015 题解
    AtCoder Grand Contest 014 题解
    AtCoder Grand Contest 013 题解
    AtCoder Grand Contest 012 题解
    AtCoder Grand Contest 011 题解
    AtCoder Grand Contest 010 题解
    AtCoder Grand Contest 009 题解
    NOIP2017 Day2 题解
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14876670.html
Copyright © 2011-2022 走看看