zoukankan      html  css  js  c++  java
  • 数据可视化基础专题(三十二):Pandas基础(十二) 分组(一)Splitting an object into groups

    1 简介 Group by: split-apply-combine

    By “group by” we are referring to a process involving one or more of the following steps:

    • Splitting the data into groups based on some criteria.

    • Applying a function to each group independently.

    • Combining the results into a data structure.

    Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to do one of the following:

    • Aggregation: compute a summary statistic (or statistics) for each group. Some examples:

      • Compute group sums or means.

      • Compute group sizes / counts.

    • Transformation: perform some group-specific computations and return a like-indexed object. Some examples:

      • Standardize data (zscore) within a group.

      • Filling NAs within groups with a value derived from each group.

    • Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

      • Discard data that belongs to groups with only a few members.

      • Filter out data based on the group sum or mean.

    • Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.

    Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who have used a SQL-based tool (or itertools), in which you can write code like:

    SELECT Column1, Column2, mean(Column3), sum(Column4)
    FROM SomeTable
    GROUP BY Column1, Column2

    We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy functionality then provide some non-trivial examples / use cases.

    See the cookbook for some advanced strategies.

    2 Splitting an object into groups

    pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:

    In [1]: df = pd.DataFrame(
       ...:     [
       ...:         ("bird", "Falconiformes", 389.0),
       ...:         ("bird", "Psittaciformes", 24.0),
       ...:         ("mammal", "Carnivora", 80.2),
       ...:         ("mammal", "Primates", np.nan),
       ...:         ("mammal", "Carnivora", 58),
       ...:     ],
       ...:     index=["falcon", "parrot", "lion", "monkey", "leopard"],
       ...:     columns=("class", "order", "max_speed"),
       ...: )
       ...: 
    
    In [2]: df
    Out[2]: 
              class           order  max_speed
    falcon     bird   Falconiformes      389.0
    parrot     bird  Psittaciformes       24.0
    lion     mammal       Carnivora       80.2
    monkey   mammal        Primates        NaN
    leopard  mammal       Carnivora       58.0
    
    # default is axis=0
    In [3]: grouped = df.groupby("class")
    
    In [4]: grouped = df.groupby("order", axis="columns")
    
    In [5]: grouped = df.groupby(["class", "order"])

    The mapping can be specified many different ways:

    • A Python function, to be called on each of the axis labels.

    • A list or NumPy array of the same length as the selected axis.

    • A dict or Series, providing a label -> group name mapping.

    • For DataFrame objects, a string indicating either a column name or an index level name to be used to group.

    • df.groupby('A') is just syntactic sugar for df.groupby(df['A']).

    • A list of any of the above things.

    Collectively we refer to the grouping objects as the keys. For example, consider the following DataFrame:

    In [6]: df = pd.DataFrame(
       ...:     {
       ...:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
       ...:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
       ...:         "C": np.random.randn(8),
       ...:         "D": np.random.randn(8),
       ...:     }
       ...: )
       ...: 
    
    In [7]: df
    Out[7]: 
         A      B         C         D
    0  foo    one  0.469112 -0.861849
    1  bar    one -0.282863 -2.104569
    2  foo    two -1.509059 -0.494929
    3  bar  three -1.135632  1.071804
    4  foo    two  1.212112  0.721555
    5  bar    two -0.173215 -0.706771
    6  foo    one  0.119209 -1.039575
    7  foo  three -1.044236  0.271860

    On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either the A or B columns, or both:

    In [8]: grouped = df.groupby("A")
    
    In [9]: grouped = df.groupby(["A", "B"])

    New in version 0.24.

    If we also have a MultiIndex on columns A and B, we can group by all but the specified columns

    In [10]: df2 = df.set_index(["A", "B"])
    
    In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"]))
    
    In [12]: grouped.sum()
    Out[12]: 
                C         D
    A                      
    bar -1.591710 -1.739537
    foo -0.752861 -1.402938

    These will split the DataFrame on its index (rows). We could also split by the columns:

    In [13]: def get_letter_type(letter):
       ....:     if letter.lower() in 'aeiou':
       ....:         return 'vowel'
       ....:     else:
       ....:         return 'consonant'
       ....: 
    
    In [14]: grouped = df.groupby(get_letter_type, axis=1)

    pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation, all values for the same index value will be considered to be in one group and thus the output of aggregation functions will only contain unique index values:

    In [15]: lst = [1, 2, 3, 1, 2, 3]
    
    In [16]: s = pd.Series([1, 2, 3, 10, 20, 30], lst)
    
    In [17]: grouped = s.groupby(level=0)
    
    In [18]: grouped.first()
    Out[18]: 
    1    1
    2    2
    3    3
    dtype: int64
    
    In [19]: grouped.last()
    Out[19]: 
    1    10
    2    20
    3    30
    dtype: int64
    
    In [20]: grouped.sum()
    Out[20]: 
    1    11
    2    22
    3    33
    dtype: int64

    Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid mapping.

    3 GroupBy sorting

    By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups:

    In [21]: df2 = pd.DataFrame({"X": ["B", "B", "A", "A"], "Y": [1, 2, 3, 4]})
    
    In [22]: df2.groupby(["X"]).sum()
    Out[22]: 
       Y
    X   
    A  7
    B  3
    
    In [23]: df2.groupby(["X"], sort=False).sum()
    Out[23]: 
       Y
    X   
    B  3
    A  7

    Note that groupby will preserve the order in which observations are sorted within each group. For example, the groups created by groupby() below are in the order they appeared in the original DataFrame:

    In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})
    
    In [25]: df3.groupby(["X"]).get_group("A")
    Out[25]: 
       X  Y
    0  A  1
    2  A  3
    
    In [26]: df3.groupby(["X"]).get_group("B")
    Out[26]: 
       X  Y
    1  B  4
    3  B  2

    4 GroupBy dropna

    By default NA values are excluded from group keys during the groupby operation. However, in case you want to include NA values in group keys, you could pass dropna=False to achieve it.

    In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
    
    In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"])
    
    In [29]: df_dropna
    Out[29]: 
       a    b  c
    0  1  2.0  3
    1  1  NaN  4
    2  2  1.0  3
    3  1  2.0  2
    # Default ``dropna`` is set to True, which will exclude NaNs in keys
    In [30]: df_dropna.groupby(by=["b"], dropna=True).sum()
    Out[30]: 
         a  c
    b        
    1.0  2  3
    2.0  2  5
    
    # In order to allow NaN in keys, set ``dropna`` to False
    In [31]: df_dropna.groupby(by=["b"], dropna=False).sum()
    Out[31]: 
         a  c
    b        
    1.0  2  3
    2.0  2  5
    NaN  1  4

    The default setting of dropna argument is True which means NA are not included in group keys.

    5 GroupBy object attributes

    The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. In the above example we have:

    In [32]: df.groupby("A").groups
    Out[32]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}
    
    In [33]: df.groupby(get_letter_type, axis=1).groups
    Out[33]: {'consonant': ['B', 'C', 'D'], 'vowel': ['A']}

    Calling the standard Python len function on the GroupBy object just returns the length of the groups dict, so it is largely just a convenience:

    In [34]: grouped = df.groupby(["A", "B"])
    
    In [35]: grouped.groups
    Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
    
    In [36]: len(grouped)
    Out[36]: 6

    GroupBy will tab complete column names (and other attributes):

    In [37]: df
    Out[37]: 
                   height      weight  gender
    2000-01-01  42.849980  157.500553    male
    2000-01-02  49.607315  177.340407    male
    2000-01-03  56.293531  171.524640    male
    2000-01-04  48.421077  144.251986  female
    2000-01-05  46.556882  152.526206    male
    2000-01-06  68.448851  168.272968  female
    2000-01-07  70.757698  136.431469    male
    2000-01-08  58.909500  176.499753  female
    2000-01-09  76.435631  174.094104  female
    2000-01-10  45.306120  177.540920    male
    
    In [38]: gb = df.groupby("gender")
    In [39]: gb.<TAB>  # noqa: E225, E999
    gb.agg        gb.boxplot    gb.cummin     gb.describe   gb.filter     gb.get_group  gb.height     gb.last       gb.median     gb.ngroups    gb.plot       gb.rank       gb.std        gb.transform
    gb.aggregate  gb.count      gb.cumprod    gb.dtype      gb.first      gb.groups     gb.hist       gb.max        gb.min        gb.nth        gb.prod       gb.resample   gb.sum        gb.var
    gb.apply      gb.cummax     gb.cumsum     gb.fillna     gb.gender     gb.head       gb.indices    gb.mean       gb.name       gb.ohlc       gb.quantile   gb.size       gb.tail       gb.weight

    6 GroupBy with MultiIndex

    With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.

    Let’s create a Series with a two-level MultiIndex.

    In [40]: arrays = [
       ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
       ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
       ....: ]
       ....: 
    
    In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
    
    In [42]: s = pd.Series(np.random.randn(8), index=index)
    
    In [43]: s
    Out[43]: 
    first  second
    bar    one      -0.919854
           two      -0.042379
    baz    one       1.247642
           two      -0.009920
    foo    one       0.290213
           two       0.495767
    qux    one       0.362949
           two       1.548106
    dtype: float64

    We can then group by one of the levels in s.

    In [44]: grouped = s.groupby(level=0)
    
    In [45]: grouped.sum()
    Out[45]: 
    first
    bar   -0.962232
    baz    1.237723
    foo    0.785980
    qux    1.911055
    dtype: float64

    If the MultiIndex has names specified, these can be passed instead of the level number:

    In [46]: s.groupby(level="second").sum()
    Out[46]: 
    second
    one    0.980950
    two    1.991575
    dtype: float64

    The aggregation functions such as sum will take the level parameter directly. Additionally, the resulting index will be named according to the chosen level:

    In [47]: s.sum(level="second")
    Out[47]: 
    second
    one    0.980950
    two    1.991575
    dtype: float64

    Grouping with multiple levels is supported.

    In [48]: s
    Out[48]: 
    first  second  third
    bar    doo     one     -1.131345
                   two     -0.089329
    baz    bee     one      0.337863
                   two     -0.945867
    foo    bop     one     -0.932132
                   two      1.956030
    qux    bop     one      0.017587
                   two     -0.016692
    dtype: float64
    
    In [49]: s.groupby(level=["first", "second"]).sum()
    Out[49]: 
    first  second
    bar    doo      -1.220674
    baz    bee      -0.608004
    foo    bop       1.023898
    qux    bop       0.000895
    dtype: float64

    Index level names may be supplied as keys.

    In [50]: s.groupby(["first", "second"]).sum()
    Out[50]: 
    first  second
    bar    doo      -1.220674
    baz    bee      -0.608004
    foo    bop       1.023898
    qux    bop       0.000895
    dtype: float64

    More on the sum function and aggregation later.

    7 Grouping DataFrame with Index levels and columns

    A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as strings and the index levels as pd.Grouper objects.

    In [51]: arrays = [
       ....:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
       ....:     ["one", "two", "one", "two", "one", "two", "one", "two"],
       ....: ]
       ....: 
    
    In [52]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
    
    In [53]: df = pd.DataFrame({"A": [1, 1, 1, 1, 2, 2, 3, 3], "B": np.arange(8)}, index=index)
    
    In [54]: df
    Out[54]: 
                  A  B
    first second      
    bar   one     1  0
          two     1  1
    baz   one     1  2
          two     1  3
    foo   one     2  4
          two     2  5
    qux   one     3  6
          two     3  7

    The following example groups df by the second index level and the A column.

    In [55]: df.groupby([pd.Grouper(level=1), "A"]).sum()
    Out[55]: 
              B
    second A   
    one    1  2
           2  4
           3  6
    two    1  4
           2  5
           3  7

    Index levels may also be specified by name.

    In [56]: df.groupby([pd.Grouper(level="second"), "A"]).sum()
    Out[56]: 
              B
    second A   
    one    1  2
           2  4
           3  6
    two    1  4
           2  5
           3  7

    Index level names may be specified as keys directly to groupby.

    In [57]: df.groupby(["second", "A"]).sum()
    Out[57]: 
              B
    second A   
    one    1  2
           2  4
           3  6
    two    1  4
           2  5
           3  7

    8 DataFrame column selection in GroupBy

    Once you have created the GroupBy object from a DataFrame, you might want to do something different for each of the columns. Thus, using [] similar to getting a column from a DataFrame, you can do:

    In [58]: grouped = df.groupby(["A"])
    
    In [59]: grouped_C = grouped["C"]
    
    In [60]: grouped_D = grouped["D"]

    This is mainly syntactic sugar for the alternative and much more verbose:

    In [61]: df["C"].groupby(df["A"])
    Out[61]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc1ed01ca90>

    Additionally this method avoids recomputing the internal grouping information derived from the passed key.

  • 相关阅读:
    CentOS 编译安装 MySQL5.7
    ERROR 2002 (HY000): Can’t connect to local MySQL server through socket ‘/var mysql 启动不了
    Linux里如何查找文件内容
    linux怎么模糊查找一个文件
    centos7下使用yum安装mysql
    centos下完全卸载mysql
    Linux下安装配置Nexus
    Linux下建立Nexus私服
    阿里云主机上安装jdk
    java war run
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14873431.html
Copyright © 2011-2022 走看看