Aggregation
Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. These operations are similar to the aggregating API, window API, and resample API.
An obvious one is aggregation via the aggregate()
or equivalently agg()
method:
In [67]: grouped = df.groupby("A") In [68]: grouped.aggregate(np.sum) Out[68]: C D A bar 0.392940 1.732707 foo -1.796421 2.824590 In [69]: grouped = df.groupby(["A", "B"]) In [70]: grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763 three 0.215897 -0.990582 two -0.077118 1.211526 foo one -0.983776 1.614581 three -0.862495 0.024580 two 0.049851 1.185429
As you can see, the result of the aggregation will have the group names as the new index along the grouped axis. In the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index
option:
In [71]: grouped = df.groupby(["A", "B"], as_index=False) In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897 -0.990582 2 bar two -0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429 In [73]: df.groupby("A", as_index=False).sum() Out[73]: A C D 0 bar 0.392940 1.732707 1 foo -1.796421 2.824590
Note that you could use the reset_index
DataFrame function to achieve the same result as the column names are stored in the resulting MultiIndex
:
In [74]: df.groupby(["A", "B"]).sum().reset_index() Out[74]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897 -0.990582 2 bar two -0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429
Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size
method. It returns a Series whose index are the group names and whose values are the sizes of each group.
In [75]: grouped.size() Out[75]: A B size 0 bar one 1 1 bar three 1 2 bar two 1 3 foo one 2 4 foo three 1 5 foo two 2
In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... std min 25% 50% 75% max 0 1.0 0.254161 NaN 0.254161 0.254161 0.254161 ... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897 0.215897 ... NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582 2 1.0 -0.077118 NaN -0.077118 -0.077118 -0.077118 ... NaN 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 ... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0 -0.862495 NaN -0.862495 -0.862495 -0.862495 ... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 ... 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081 [6 rows x 16 columns]
Another aggregation example is to compute the number of unique values of each group. This is similar to the value_counts
function, except that it only counts unique values.
In [77]: ll = [['foo', 1], ['foo', 2], ['foo', 2], ['bar', 1], ['bar', 1]] In [78]: df4 = pd.DataFrame(ll, columns=["A", "B"]) In [79]: df4 Out[79]: A B 0 foo 1 1 foo 2 2 foo 2 3 bar 1 4 bar 1 In [80]: df4.groupby("A")["B"].nunique() Out[80]: A bar 1 foo 2 Name: B, dtype: int64
Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating functions are tabulated below:
Function |
Description |
---|---|
|
Compute mean of groups |
|
Compute sum of group values |
|
Compute group sizes |
|
Compute count of group |
|
Standard deviation of groups |
|
Compute variance of groups |
|
Standard error of the mean of groups |
|
Generates descriptive statistics |
|
Compute first of group values |
|
Compute last of group values |
|
Take nth value, or a subset if n is a list |
|
Compute min of group values |
|
Compute max of group values |
The aggregating functions above will exclude NA values. Any function which reduces a Series
to a scalar value is an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser: 1)
. Note that nth()
can act as a reducer or a filter, see here.
1 Applying multiple functions at once
With grouped Series
you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:
In [81]: grouped = df.groupby("A") In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: sum mean std A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265
On a grouped DataFrame
, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index:
In [83]: grouped.agg([np.sum, np.mean, np.std]) Out[83]: C D sum mean std sum mean std A bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330 foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series
like this:
In [84]: ( ....: grouped["C"] ....: .agg([np.sum, np.mean, np.std]) ....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"}) ....: ) ....: Out[84]: foo bar baz A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265
For a grouped DataFrame
, you can rename in a similar manner:
In [85]: ( ....: grouped.agg([np.sum, np.mean, np.std]).rename( ....: columns={"sum": "foo", "mean": "bar", "std": "baz"} ....: ) ....: ) ....: Out[85]: C D foo bar baz foo bar baz A bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330 foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
In general, the output column names should be unique. You can’t apply the same function (or two functions with the same name) to the same column.
In [86]: grouped["C"].agg(["sum", "sum"])
Out[86]:
sum sum
A
bar 0.392940 0.392940
foo -1.796421 -1.796421
pandas does allow you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless) lambda functions, appending _<i>
to each subsequent lambda.
In [87]: grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()])
Out[87]:
<lambda_0> <lambda_1>
A
bar 0.331279 0.084917
foo 2.337259 -0.215962
2 Named aggregation
To support column-specific aggregation with control over the output column names, pandas accepts the special syntax in GroupBy.agg()
, known as “named aggregation”, where
-
The keywords are the output column names
-
The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas provides the
pandas.NamedAgg
namedtuple with the fields['column', 'aggfunc']
to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias.
In [88]: animals = pd.DataFrame( ....: { ....: "kind": ["cat", "dog", "cat", "dog"], ....: "height": [9.1, 6.0, 9.5, 34.0], ....: "weight": [7.9, 7.5, 9.9, 198.0], ....: } ....: ) ....: In [89]: animals Out[89]: kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0 In [90]: animals.groupby("kind").agg( ....: min_height=pd.NamedAgg(column="height", aggfunc="min"), ....: max_height=pd.NamedAgg(column="height", aggfunc="max"), ....: average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean), ....: ) ....: Out[90]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75
pandas.NamedAgg
is just a namedtuple
. Plain tuples are allowed as well.
In [91]: animals.groupby("kind").agg( ....: min_height=("height", "min"), ....: max_height=("height", "max"), ....: average_weight=("weight", np.mean), ....: ) ....: Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75
If your desired output column names are not valid Python keywords, construct a dictionary and unpack the keyword arguments
In [92]: animals.groupby("kind").agg( ....: **{ ....: "total weight": pd.NamedAgg(column="weight", aggfunc=sum) ....: } ....: ) ....: Out[92]: total weight kind cat 17.8 dog 205.5
Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column, aggfunc)
should be passed as **kwargs
. If your aggregation functions requires additional arguments, partially apply them with functools.partial()
.
Note
For Python 3.5 and earlier, the order of **kwargs
in a functions was not preserved. This means that the output column ordering would not be consistent. To ensure consistent ordering, the keys (and so output columns) will always be sorted for Python 3.5.
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.
In [93]: animals.groupby("kind").height.agg( ....: min_height="min", ....: max_height="max", ....: ) ....: Out[93]: min_height max_height kind cat 9.1 9.5 dog 6.0 34.0
3 Applying different functions to DataFrame columns
By passing a dict to aggregate
you can apply a different aggregation to the columns of a DataFrame:
In [94]: grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)}) Out[94]: C D A bar 0.392940 1.366330 foo -1.796421 0.884785
The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or available via dispatching:
In [95]: grouped.agg({"C": "sum", "D": "std"}) Out[95]: C D A bar 0.392940 1.366330 foo -1.796421 0.884785
4 Cython-optimized aggregation functions
Some common aggregations, currently only sum
, mean
, std
, and sem
, have optimized Cython implementations:
In [96]: df.groupby("A").sum() Out[96]: C D A bar 0.392940 1.732707 foo -1.796421 2.824590 In [97]: df.groupby(["A", "B"]).mean() Out[97]: C D A B bar one 0.254161 1.511763 three 0.215897 -0.990582 two -0.077118 1.211526 foo one -0.491888 0.807291 three -0.862495 0.024580 two 0.024925 0.592714
Of course sum
and mean
are implemented on pandas objects, so the above code would work even without the special versions via dispatching (see below).
Flexible apply
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply want GroupBy to infer how to combine the results. For these, use the apply
function, which can be substituted for both aggregate
and transform
in many standard use cases. However, apply
can handle some exceptional use cases, for example:
In [156]: df Out[156]: A B C D 0 foo one -0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two -1.143704 1.627081 3 bar three 0.215897 -0.990582 4 foo two 1.193555 -0.441652 5 bar two -0.077118 1.211526 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580 In [157]: grouped = df.groupby("A") # could also just call .describe() In [158]: grouped["C"].apply(lambda x: x.describe()) Out[158]: A bar count 3.000000 mean 0.130980 std 0.181231 min -0.077118 25% 0.069390 ... foo min -1.143704 25% -0.862495 50% -0.575247 75% -0.408530 max 1.193555 Name: C, Length: 16, dtype: float64
The dimension of the returned result can also change:
In [159]: grouped = df.groupby('A')['C'] In [160]: def f(group): .....: return pd.DataFrame({'original': group, .....: 'demeaned': group - group.mean()}) .....: In [161]: grouped.apply(f) Out[161]: original demeaned 0 -0.575247 -0.215962 1 0.254161 0.123181 2 -1.143704 -0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5 -0.077118 -0.208098 6 -0.408530 -0.049245 7 -0.862495 -0.503211
apply
on a Series can operate on a returned value from the applied function, that is itself a series, and possibly upcast the result to a DataFrame:
In [162]: def f(x): .....: return pd.Series([x, x ** 2], index=["x", "x^2"]) .....: In [163]: s = pd.Series(np.random.rand(5)) In [164]: s Out[164]: 0 0.321438 1 0.493496 2 0.139505 3 0.910103 4 0.194158 dtype: float64 In [165]: s.apply(f) Out[165]: x x^2 0 0.321438 0.103323 1 0.493496 0.243538 2 0.139505 0.019462 3 0.910103 0.828287 4 0.194158 0.037697