1 简介 Group by: split-apply-combine
By “group by” we are referring to a process involving one or more of the following steps:
-
Splitting the data into groups based on some criteria.
-
Applying a function to each group independently.
-
Combining the results into a data structure.
Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to do one of the following:
-
Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
-
Compute group sums or means.
-
Compute group sizes / counts.
-
-
Transformation: perform some group-specific computations and return a like-indexed object. Some examples:
-
Standardize data (zscore) within a group.
-
Filling NAs within groups with a value derived from each group.
-
-
Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
-
Discard data that belongs to groups with only a few members.
-
Filter out data based on the group sum or mean.
-
-
Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.
Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools
), in which you can write code like:
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality then provide some non-trivial examples / use cases.
See the cookbook for some advanced strategies.
2 Splitting an object into groups
pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:
In [1]: df = pd.DataFrame( ...: [ ...: ("bird", "Falconiformes", 389.0), ...: ("bird", "Psittaciformes", 24.0), ...: ("mammal", "Carnivora", 80.2), ...: ("mammal", "Primates", np.nan), ...: ("mammal", "Carnivora", 58), ...: ], ...: index=["falcon", "parrot", "lion", "monkey", "leopard"], ...: columns=("class", "order", "max_speed"), ...: ) ...: In [2]: df Out[2]: class order max_speed falcon bird Falconiformes 389.0 parrot bird Psittaciformes 24.0 lion mammal Carnivora 80.2 monkey mammal Primates NaN leopard mammal Carnivora 58.0 # default is axis=0 In [3]: grouped = df.groupby("class") In [4]: grouped = df.groupby("order", axis="columns") In [5]: grouped = df.groupby(["class", "order"])
The mapping can be specified many different ways:
-
A Python function, to be called on each of the axis labels.
-
A list or NumPy array of the same length as the selected axis.
-
A dict or
Series
, providing alabel -> group name
mapping. -
For
DataFrame
objects, a string indicating either a column name or an index level name to be used to group. -
df.groupby('A')
is just syntactic sugar fordf.groupby(df['A'])
. -
A list of any of the above things.
Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame
:
In [6]: df = pd.DataFrame( ...: { ...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], ...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"], ...: "C": np.random.randn(8), ...: "D": np.random.randn(8), ...: } ...: ) ...: In [7]: df Out[7]: A B C D 0 foo one 0.469112 -0.861849 1 bar one -0.282863 -2.104569 2 foo two -1.509059 -0.494929 3 bar three -1.135632 1.071804 4 foo two 1.212112 0.721555 5 bar two -0.173215 -0.706771 6 foo one 0.119209 -1.039575 7 foo three -1.044236 0.271860
On a DataFrame, we obtain a GroupBy object by calling groupby()
. We could naturally group by either the A
or B
columns, or both:
In [8]: grouped = df.groupby("A") In [9]: grouped = df.groupby(["A", "B"])
New in version 0.24.
If we also have a MultiIndex on columns A
and B
, we can group by all but the specified columns
In [10]: df2 = df.set_index(["A", "B"]) In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"])) In [12]: grouped.sum() Out[12]: C D A bar -1.591710 -1.739537 foo -0.752861 -1.402938
These will split the DataFrame on its index (rows). We could also split by the columns:
In [13]: def get_letter_type(letter): ....: if letter.lower() in 'aeiou': ....: return 'vowel' ....: else: ....: return 'consonant' ....: In [14]: grouped = df.groupby(get_letter_type, axis=1)
pandas Index
objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:
In [15]: lst = [1, 2, 3, 1, 2, 3] In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst) In [17]: grouped = s.groupby(level=0) In [18]: grouped.first() Out[18]: 1 1 2 2 3 3 dtype: int64 In [19]: grouped.last() Out[19]: 1 10 2 20 3 30 dtype: int64 In [20]: grouped.sum() Out[20]: 1 11 2 22 3 33 dtype: int64
Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.
3 GroupBy sorting
By default the group keys are sorted during the groupby
operation. You may however pass sort=False
for potential speedups:
In [21]: df2 = pd.DataFrame({"X": ["B", "B", "A", "A"], "Y": [1, 2, 3, 4]}) In [22]: df2.groupby(["X"]).sum() Out[22]: Y X A 7 B 3 In [23]: df2.groupby(["X"], sort=False).sum() Out[23]: Y X B 3 A 7
Note that groupby
will preserve the order in which observations are sorted within each group. For example, the groups created by groupby()
below are in the order they appeared in the original DataFrame
:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]}) In [25]: df3.groupby(["X"]).get_group("A") Out[25]: X Y 0 A 1 2 A 3 In [26]: df3.groupby(["X"]).get_group("B") Out[26]: X Y 1 B 4 3 B 2
4 GroupBy dropna
By default NA
values are excluded from group keys during the groupby
operation. However, in case you want to include NA
values in group keys, you could pass dropna=False
to achieve it.
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"]) In [29]: df_dropna Out[29]: a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2
# Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: a c b 1.0 2 3 2.0 2 5 # In order to allow NaN in keys, set ``dropna`` to False In [31]: df_dropna.groupby(by=["b"], dropna=False).sum() Out[31]: a c b 1.0 2 3 2.0 2 5 NaN 1 4
The default setting of dropna
argument is True
which means NA
are not included in group keys.
5 GroupBy object attributes
The groups
attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. In the above example we have:
In [32]: df.groupby("A").groups Out[32]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]} In [33]: df.groupby(get_letter_type, axis=1).groups Out[33]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}
Calling the standard Python len
function on the GroupBy object just returns the length of the groups
dict, so it is largely just a convenience:
In [34]: grouped = df.groupby(["A", "B"]) In [35]: grouped.groups Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]} In [36]: len(grouped) Out[36]: 6
GroupBy
will tab complete column names (and other attributes):
In [37]: df Out[37]: height weight gender 2000-01-01 42.849980 157.500553 male 2000-01-02 49.607315 177.340407 male 2000-01-03 56.293531 171.524640 male 2000-01-04 48.421077 144.251986 female 2000-01-05 46.556882 152.526206 male 2000-01-06 68.448851 168.272968 female 2000-01-07 70.757698 136.431469 male 2000-01-08 58.909500 176.499753 female 2000-01-09 76.435631 174.094104 female 2000-01-10 45.306120 177.540920 male In [38]: gb = df.groupby("gender")
In [39]: gb.<TAB> # noqa: E225, E999 gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight
6 GroupBy with MultiIndex
With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.
Let’s create a Series with a two-level MultiIndex
.
In [40]: arrays = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] ....: In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"]) In [42]: s = pd.Series(np.random.randn(8), index=index) In [43]: s Out[43]: first second bar one -0.919854 two -0.042379 baz one 1.247642 two -0.009920 foo one 0.290213 two 0.495767 qux one 0.362949 two 1.548106 dtype: float64
We can then group by one of the levels in s
.
In [44]: grouped = s.groupby(level=0) In [45]: grouped.sum() Out[45]: first bar -0.962232 baz 1.237723 foo 0.785980 qux 1.911055 dtype: float64
If the MultiIndex has names specified, these can be passed instead of the level number:
In [46]: s.groupby(level="second").sum() Out[46]: second one 0.980950 two 1.991575 dtype: float64
The aggregation functions such as sum
will take the level parameter directly. Additionally, the resulting index will be named according to the chosen level:
In [47]: s.sum(level="second") Out[47]: second one 0.980950 two 1.991575 dtype: float64
Grouping with multiple levels is supported.
In [48]: s Out[48]: first second third bar doo one -1.131345 two -0.089329 baz bee one 0.337863 two -0.945867 foo bop one -0.932132 two 1.956030 qux bop one 0.017587 two -0.016692 dtype: float64 In [49]: s.groupby(level=["first", "second"]).sum() Out[49]: first second bar doo -1.220674 baz bee -0.608004 foo bop 1.023898 qux bop 0.000895 dtype: float64
Index level names may be supplied as keys.
In [50]: s.groupby(["first", "second"]).sum() Out[50]: first second bar doo -1.220674 baz bee -0.608004 foo bop 1.023898 qux bop 0.000895 dtype: float64
More on the sum
function and aggregation later.
7 Grouping DataFrame with Index levels and columns
A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index levels as pd.Grouper
objects.
In [51]: arrays = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] ....: In [52]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"]) In [53]: df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index) In [54]: df Out[54]: A B first second bar one 1 0 two 1 1 baz one 1 2 two 1 3 foo one 2 4 two 2 5 qux one 3 6 two 3 7
The following example groups df
by the second
index level and the A
column.
In [55]: df.groupby([pd.Grouper(level=1), "A"]).sum() Out[55]: B second A one 1 2 2 4 3 6 two 1 4 2 5 3 7
Index levels may also be specified by name.
In [56]: df.groupby([pd.Grouper(level="second"), "A"]).sum() Out[56]: B second A one 1 2 2 4 3 6 two 1 4 2 5 3 7
Index level names may be specified as keys directly to groupby
.
In [57]: df.groupby(["second", "A"]).sum() Out[57]: B second A one 1 2 2 4 3 6 two 1 4 2 5 3 7
8 DataFrame column selection in GroupBy
Once you have created the GroupBy object from a DataFrame, you might want to do something different for each of the columns. Thus, using []
similar to getting a column from a DataFrame, you can do:
In [58]: grouped = df.groupby(["A"]) In [59]: grouped_C = grouped["C"] In [60]: grouped_D = grouped["D"]
This is mainly syntactic sugar for the alternative and much more verbose:
In [61]: df["C"].groupby(df["A"]) Out[61]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc1ed01ca90>
Additionally this method avoids recomputing the internal grouping information derived from the passed key.