zoukankan      html  css  js  c++  java
  • Data aggregation and group operations in pandas

    Data aggregation and group operations in pandas

    After loading,merging and preparing a dataset,you may need to compute group statistics or possibly pivot tables for reporting or visualization purpose.pandas provides a flexible groupby interface,enabling you to slice,dice and summarize datasets in a natural way.

    import pandas as pd
    
    import numpy as np
    
    import matplotlib.pyplot as plt
    

    Groupby mechanics

    split-apply-combine:
    In the first stage of the process,data contained in a pandas object,whether a Series,DataFrame,or otherwise,is split into groups based on one or more keys that you provide.The splitting is performed on a particular axis off an object.For example,a DataFrame can be grouped on its rows(axis=0) or its columns(axis=1).
    Once the first step is done, a function is applied to each group,producing a new value.
    Finally,the results of all those function applications are combined into a result object.

    help(pd.Series.groupby)
    
    Help on function groupby in module pandas.core.generic:
    
    groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
        Group series using mapper (dict or key function, apply given function
        to group, return result as series) or by a series of columns.
        
        Parameters
        ----------
        by : mapping function / list of functions, dict, Series, or tuple /
            list of column names.
            Called on each element of the object index to determine the groups.
            If a dict or Series is passed, the Series or dict VALUES will be
            used to determine the groups
        axis : int, default 0
        level : int, level name, or sequence of such, default None
            If the axis is a MultiIndex (hierarchical), group by a particular
            level or levels
        as_index : boolean, default True
            For aggregated output, return object with group labels as the
            index. Only relevant for DataFrame input. as_index=False is
            effectively "SQL-style" grouped output
        sort : boolean, default True
            Sort group keys. Get better performance by turning this off.
            Note this does not influence the order of observations within each
            group.  groupby preserves the order of rows within each group.
        group_keys : boolean, default True
            When calling apply, add group keys to index to identify pieces
        squeeze : boolean, default False
            reduce the dimensionality of the return type if possible,
            otherwise return a consistent type
        
        Examples
        --------
        DataFrame results
        
        >>> data.groupby(func, axis=0).mean()
        >>> data.groupby(['col1', 'col2'])['col3'].mean()
        
        DataFrame with hierarchical index
        
        >>> data.groupby(['col1', 'col2']).mean()
        
        Returns
        -------
        GroupBy object
    
    df=pd.DataFrame({'key1':['a','a','b','b','a'],
                    'key2':['one','two','one','two','one'],
                    'data1':np.random.randn(5),
                    'data2':np.random.randn(5)});df
    
    data1 data2 key1 key2
    0 -0.288087 -1.355619 a one
    1 0.804061 0.654039 a two
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    4 -1.110942 0.581883 a one
    • The first method:
      Suppose you wanted to compute the mean of the data column using the labels from key1.There are a number of ways to do this.One is to access data1 and call groupby with the column(a Series) at key1:
    grouped=df['data1'].groupby(df['key1'])
    
    help(grouped)
    
    Help on SeriesGroupBy in module pandas.core.groupby object:
    
    class SeriesGroupBy(GroupBy)
     |  Class for grouping and aggregating relational data. See aggregate,
     |  transform, and apply functions on this object.
     |  
     |  It's easiest to use obj.groupby(...) to use GroupBy, but you can also do:
     |  
     |  ::
     |  
     |      grouped = groupby(obj, ...)
     |  
     |  Parameters
     |  ----------
     |  obj : pandas object
     |  axis : int, default 0
     |  level : int, default None
     |      Level of MultiIndex
     |  groupings : list of Grouping objects
     |      Most users should ignore this
     |  exclusions : array-like, optional
     |      List of columns to exclude
     |  name : string
     |      Most users should ignore this
     |  
     |  Notes
     |  -----
     |  After grouping, see aggregate, apply, and transform functions. Here are
     |  some other brief notes about usage. When grouping by multiple groups, the
     |  result index will be a MultiIndex (hierarchical) by default.
     |  
     |  Iteration produces (key, group) tuples, i.e. chunking the data by group. So
     |  you can write code like:
     |  
     |  ::
     |  
     |      grouped = obj.groupby(keys, axis=axis)
     |      for key, group in grouped:
     |          # do something with the data
     |  
     |  Function calls on GroupBy, if not specially implemented, "dispatch" to the
     |  grouped data. So if you group a DataFrame and wish to invoke the std()
     |  method on each group, you can simply do:
     |  
     |  ::
     |  
     |      df.groupby(mapper).std()
     |  
     |  rather than
     |  
     |  ::
     |  
     |      df.groupby(mapper).aggregate(np.std)
     |  
     |  You can pass arguments to these "wrapped" functions, too.
     |  
     |  See the online documentation for full exposition on these topics and much
     |  more
     |  
     |  Returns
     |  -------
     |  **Attributes**
     |  groups : dict
     |      {group name -> group labels}
     |  len(grouped) : int
     |      Number of groups
     |  
     |  Method resolution order:
     |      SeriesGroupBy
     |      GroupBy
     |      _GroupBy
     |      pandas.core.base.PandasObject
     |      pandas.core.base.StringMixin
     |      pandas.core.base.SelectionMixin
     |      builtins.object
     |  
     |  Methods defined here:
     |  
     |  agg = aggregate(self, func_or_funcs, *args, **kwargs)
     |  
     |  aggregate(self, func_or_funcs, *args, **kwargs)
     |      Apply aggregation function or functions to groups, yielding most likely
     |      Series but in some cases DataFrame depending on the output of the
     |      aggregation function
     |      
     |      Parameters
     |      ----------
     |      func_or_funcs : function or list / dict of functions
     |          List/dict of functions will produce DataFrame with column names
     |          determined by the function names themselves (list) or the keys in
     |          the dict
     |      
     |      Notes
     |      -----
     |      agg is an alias for aggregate. Use it.
     |      
     |      Examples
     |      --------
     |      >>> series
     |      bar    1.0
     |      baz    2.0
     |      qot    3.0
     |      qux    4.0
     |      
     |      >>> mapper = lambda x: x[0] # first letter
     |      >>> grouped = series.groupby(mapper)
     |      
     |      >>> grouped.aggregate(np.sum)
     |      b    3.0
     |      q    7.0
     |      
     |      >>> grouped.aggregate([np.sum, np.mean, np.std])
     |         mean  std  sum
     |      b  1.5   0.5  3
     |      q  3.5   0.5  7
     |      
     |      >>> grouped.agg({'result' : lambda x: x.mean() / x.std(),
     |      ...              'total' : np.sum})
     |         result  total
     |      b  2.121   3
     |      q  4.95    7
     |      
     |      See also
     |      --------
     |      apply, transform
     |      
     |      Returns
     |      -------
     |      Series or DataFrame
     |  
     |  count(self)
     |      Compute count of group, excluding missing values
     |  
     |  filter(self, func, dropna=True, *args, **kwargs)
     |      Return a copy of a Series excluding elements from groups that
     |      do not satisfy the boolean criterion specified by func.
     |      
     |      Parameters
     |      ----------
     |      func : function
     |          To apply to each group. Should return True or False.
     |      dropna : Drop groups that do not pass the filter. True by default;
     |          if False, groups that evaluate False are filled with NaNs.
     |      
     |      Examples
     |      --------
     |      >>> grouped.filter(lambda x: x.mean() > 0)
     |      
     |      Returns
     |      -------
     |      filtered : Series
     |  
     |  nlargest(self, n=5, keep='first')
     |      Return the largest `n` elements.
     |      
     |      Parameters
     |      ----------
     |      n : int
     |          Return this many descending sorted values
     |      keep : {'first', 'last', False}, default 'first'
     |          Where there are duplicate values:
     |          - ``first`` : take the first occurrence.
     |          - ``last`` : take the last occurrence.
     |      take_last : deprecated
     |      
     |      Returns
     |      -------
     |      top_n : Series
     |          The n largest values in the Series, in sorted order
     |      
     |      Notes
     |      -----
     |      Faster than ``.sort_values(ascending=False).head(n)`` for small `n`
     |      relative to the size of the ``Series`` object.
     |      
     |      See Also
     |      --------
     |      Series.nsmallest
     |      
     |      Examples
     |      --------
     |      >>> import pandas as pd
     |      >>> import numpy as np
     |      >>> s = pd.Series(np.random.randn(1e6))
     |      >>> s.nlargest(10)  # only sorts up to the N requested
     |  
     |  nsmallest(self, n=5, keep='first')
     |      Return the smallest `n` elements.
     |      
     |      Parameters
     |      ----------
     |      n : int
     |          Return this many ascending sorted values
     |      keep : {'first', 'last', False}, default 'first'
     |          Where there are duplicate values:
     |          - ``first`` : take the first occurrence.
     |          - ``last`` : take the last occurrence.
     |      take_last : deprecated
     |      
     |      Returns
     |      -------
     |      bottom_n : Series
     |          The n smallest values in the Series, in sorted order
     |      
     |      Notes
     |      -----
     |      Faster than ``.sort_values().head(n)`` for small `n` relative to
     |      the size of the ``Series`` object.
     |      
     |      See Also
     |      --------
     |      Series.nlargest
     |      
     |      Examples
     |      --------
     |      >>> import pandas as pd
     |      >>> import numpy as np
     |      >>> s = pd.Series(np.random.randn(1e6))
     |      >>> s.nsmallest(10)  # only sorts up to the N requested
     |  
     |  nunique(self, dropna=True)
     |      Returns number of unique elements in the group
     |  
     |  transform(self, func, *args, **kwargs)
     |      Call function producing a like-indexed Series on each group and return
     |      a Series with the transformed values
     |      
     |      Parameters
     |      ----------
     |      func : function
     |          To apply to each group. Should return a Series with the same index
     |      
     |      Examples
     |      --------
     |      >>> grouped.transform(lambda x: (x - x.mean()) / x.std())
     |      
     |      Returns
     |      -------
     |      transformed : Series
     |  
     |  value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  all
     |      
     |      Return whether all elements are True over requested axis
     |      
     |      Parameters
     |      ----------
     |      axis : {index (0)}
     |      skipna : boolean, default True
     |          Exclude NA/null values. If an entire row/column is NA, the result
     |          will be NA
     |      level : int or level name, default None
     |          If the axis is a MultiIndex (hierarchical), count along a
     |          particular level, collapsing into a scalar
     |      bool_only : boolean, default None
     |          Include only boolean columns. If None, will attempt to use everything,
     |          then use only boolean data. Not implemented for Series.
     |      
     |      Returns
     |      -------
     |      all : scalar or Series (if level specified)
     |  
     |  any
     |      
     |      Return whether any element is True over requested axis
     |      
     |      Parameters
     |      ----------
     |      axis : {index (0)}
     |      skipna : boolean, default True
     |          Exclude NA/null values. If an entire row/column is NA, the result
     |          will be NA
     |      level : int or level name, default None
     |          If the axis is a MultiIndex (hierarchical), count along a
     |          particular level, collapsing into a scalar
     |      bool_only : boolean, default None
     |          Include only boolean columns. If None, will attempt to use everything,
     |          then use only boolean data. Not implemented for Series.
     |      
     |      Returns
     |      -------
     |      any : scalar or Series (if level specified)
     |  
     |  corr
     |      Compute correlation with `other` Series, excluding missing values
     |      
     |      Parameters
     |      ----------
     |      other : Series
     |      method : {'pearson', 'kendall', 'spearman'}
     |          * pearson : standard correlation coefficient
     |          * kendall : Kendall Tau correlation coefficient
     |          * spearman : Spearman rank correlation
     |      min_periods : int, optional
     |          Minimum number of observations needed to have a valid result
     |      
     |      
     |      Returns
     |      -------
     |      correlation : float
     |  
     |  cov
     |      Compute covariance with Series, excluding missing values
     |      
     |      Parameters
     |      ----------
     |      other : Series
     |      min_periods : int, optional
     |          Minimum number of observations needed to have a valid result
     |      
     |      Returns
     |      -------
     |      covariance : float
     |      
     |      Normalized by N-1 (unbiased estimator).
     |  
     |  cummax
     |          Return cumulative max over requested axis.
     |      
     |      Parameters
     |      ----------
     |      axis : {index (0)}
     |      skipna : boolean, default True
     |          Exclude NA/null values. If an entire row/column is NA, the result
     |          will be NA
     |      
     |      Returns
     |      -------
     |      cummax : scalar
     |  
     |  cummin
     |          Return cumulative minimum over requested axis.
     |      
     |      Parameters
     |      ----------
     |      axis : {index (0)}
     |      skipna : boolean, default True
     |          Exclude NA/null values. If an entire row/column is NA, the result
     |          will be NA
     |      
     |      Returns
     |      -------
     |      cummin : scalar
     |  
     |  describe
     |      Generate various summary statistics, excluding NaN values.
     |      
     |      Parameters
     |      ----------
     |      percentiles : array-like, optional
     |          The percentiles to include in the output. Should all
     |          be in the interval [0, 1]. By default `percentiles` is
     |          [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
     |      include, exclude : list-like, 'all', or None (default)
     |          Specify the form of the returned result. Either:
     |      
     |          - None to both (default). The result will include only
     |            numeric-typed columns or, if none are, only categorical columns.
     |          - A list of dtypes or strings to be included/excluded.
     |            To select all numeric types use numpy numpy.number. To select
     |            categorical objects use type object. See also the select_dtypes
     |            documentation. eg. df.describe(include=['O'])
     |          - If include is the string 'all', the output column-set will
     |            match the input one.
     |      
     |      Returns
     |      -------
     |      summary: NDFrame of summary statistics
     |      
     |      Notes
     |      -----
     |      The output DataFrame index depends on the requested dtypes:
     |      
     |      For numeric dtypes, it will include: count, mean, std, min,
     |      max, and lower, 50, and upper percentiles.
     |      
     |      For object dtypes (e.g. timestamps or strings), the index
     |      will include the count, unique, most common, and frequency of the
     |      most common. Timestamps also include the first and last items.
     |      
     |      For mixed dtypes, the index will be the union of the corresponding
     |      output types. Non-applicable entries will be filled with NaN.
     |      Note that mixed-dtype outputs can only be returned from mixed-dtype
     |      inputs and appropriate use of the include/exclude arguments.
     |      
     |      If multiple values have the highest count, then the
     |      `count` and `most common` pair will be arbitrarily chosen from
     |      among those with the highest count.
     |      
     |      The include, exclude arguments are ignored for Series.
     |      
     |      See Also
     |      --------
     |      DataFrame.select_dtypes
     |  
     |  diff
     |      1st discrete difference of object
     |      
     |      Parameters
     |      ----------
     |      periods : int, default 1
     |          Periods to shift for forming difference
     |      
     |      Returns
     |      -------
     |      diffed : Series
     |  
     |  dtype
     |      return the dtype object of the underlying data
     |  
     |  fillna
     |      Fill NA/NaN values using the specified method
     |      
     |      Parameters
     |      ----------
     |      value : scalar, dict, Series, or DataFrame
     |          Value to use to fill holes (e.g. 0), alternately a
     |          dict/Series/DataFrame of values specifying which value to use for
     |          each index (for a Series) or column (for a DataFrame). (values not
     |          in the dict/Series/DataFrame will not be filled). This value cannot
     |          be a list.
     |      method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
     |          Method to use for filling holes in reindexed Series
     |          pad / ffill: propagate last valid observation forward to next valid
     |          backfill / bfill: use NEXT valid observation to fill gap
     |      axis : {0, 'index'}
     |      inplace : boolean, default False
     |          If True, fill in place. Note: this will modify any
     |          other views on this object, (e.g. a no-copy slice for a column in a
     |          DataFrame).
     |      limit : int, default None
     |          If method is specified, this is the maximum number of consecutive
     |          NaN values to forward/backward fill. In other words, if there is
     |          a gap with more than this number of consecutive NaNs, it will only
     |          be partially filled. If method is not specified, this is the
     |          maximum number of entries along the entire axis where NaNs will be
     |          filled.
     |      downcast : dict, default is None
     |          a dict of item->dtype of what to downcast if possible,
     |          or the string 'infer' which will try to downcast to an appropriate
     |          equal type (e.g. float64 to int64 if possible)
     |      
     |      See Also
     |      --------
     |      reindex, asfreq
     |      
     |      Returns
     |      -------
     |      filled : Series
     |  
     |  hist
     |      Draw histogram of the input series using matplotlib
     |      
     |      Parameters
     |      ----------
     |      by : object, optional
     |          If passed, then used to form histograms for separate groups
     |      ax : matplotlib axis object
     |          If not passed, uses gca()
     |      grid : boolean, default True
     |          Whether to show axis grid lines
     |      xlabelsize : int, default None
     |          If specified changes the x-axis label size
     |      xrot : float, default None
     |          rotation of x axis labels
     |      ylabelsize : int, default None
     |          If specified changes the y-axis label size
     |      yrot : float, default None
     |          rotation of y axis labels
     |      figsize : tuple, default None
     |          figure size in inches by default
     |      bins: integer, default 10
     |          Number of histogram bins to be used
     |      kwds : keywords
     |          To be passed to the actual plotting function
     |      
     |      Notes
     |      -----
     |      See matplotlib documentation online for more on this
     |  
     |  idxmax
     |      Index of first occurrence of maximum of values.
     |      
     |      Parameters
     |      ----------
     |      skipna : boolean, default True
     |          Exclude NA/null values
     |      
     |      Returns
     |      -------
     |      idxmax : Index of maximum of values
     |      
     |      Notes
     |      -----
     |      This method is the Series version of ``ndarray.argmax``.
     |      
     |      See Also
     |      --------
     |      DataFrame.idxmax
     |      numpy.ndarray.argmax
     |  
     |  idxmin
     |      Index of first occurrence of minimum of values.
     |      
     |      Parameters
     |      ----------
     |      skipna : boolean, default True
     |          Exclude NA/null values
     |      
     |      Returns
     |      -------
     |      idxmin : Index of minimum of values
     |      
     |      Notes
     |      -----
     |      This method is the Series version of ``ndarray.argmin``.
     |      
     |      See Also
     |      --------
     |      DataFrame.idxmin
     |      numpy.ndarray.argmin
     |  
     |  mad
     |      
     |      Return the mean absolute deviation of the values for the requested axis
     |      
     |      Parameters
     |      ----------
     |      axis : {index (0)}
     |      skipna : boolean, default True
     |          Exclude NA/null values. If an entire row/column is NA, the result
     |          will be NA
     |      level : int or level name, default None
     |          If the axis is a MultiIndex (hierarchical), count along a
     |          particular level, collapsing into a scalar
     |      numeric_only : boolean, default None
     |          Include only float, int, boolean columns. If None, will attempt to use
     |          everything, then use only numeric data. Not implemented for Series.
     |      
     |      Returns
     |      -------
     |      mad : scalar or Series (if level specified)
     |  
     |  name
     |      since we are a series, we by definition only have
     |      a single name, but may be the result of a selection or
     |      the name of our object
     |  
     |  pct_change
     |      Percent change over given number of periods.
     |      
     |      Parameters
     |      ----------
     |      periods : int, default 1
     |          Periods to shift for forming percent change
     |      fill_method : str, default 'pad'
     |          How to handle NAs before computing percent changes
     |      limit : int, default None
     |          The number of consecutive NAs to fill before stopping
     |      freq : DateOffset, timedelta, or offset alias string, optional
     |          Increment to use from time series API (e.g. 'M' or BDay())
     |      
     |      Returns
     |      -------
     |      chg : NDFrame
     |      
     |      Notes
     |      -----
     |      
     |      By default, the percentage change is calculated along the stat
     |      axis: 0, or ``Index``, for ``DataFrame`` and 1, or ``minor`` for
     |      ``Panel``. You can change this with the ``axis`` keyword argument.
     |  
     |  quantile
     |      Return value at the given quantile, a la numpy.percentile.
     |      
     |      Parameters
     |      ----------
     |      q : float or array-like, default 0.5 (50% quantile)
     |          0 <= q <= 1, the quantile(s) to compute
     |      interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
     |          .. versionadded:: 0.18.0
     |      
     |          This optional parameter specifies the interpolation method to use,
     |          when the desired quantile lies between two data points `i` and `j`:
     |      
     |              * linear: `i + (j - i) * fraction`, where `fraction` is the
     |                fractional part of the index surrounded by `i` and `j`.
     |              * lower: `i`.
     |              * higher: `j`.
     |              * nearest: `i` or `j` whichever is nearest.
     |              * midpoint: (`i` + `j`) / 2.
     |      
     |      Returns
     |      -------
     |      quantile : float or Series
     |          if ``q`` is an array, a Series will be returned where the
     |          index is ``q`` and the values are the quantiles.
     |      
     |      Examples
     |      --------
     |      >>> s = Series([1, 2, 3, 4])
     |      >>> s.quantile(.5)
     |      2.5
     |      >>> s.quantile([.25, .5, .75])
     |      0.25    1.75
     |      0.50    2.50
     |      0.75    3.25
     |      dtype: float64
     |  
     |  rank
     |      Compute numerical data ranks (1 through n) along axis. Equal values are
     |      assigned a rank that is the average of the ranks of those values
     |      
     |      Parameters
     |      ----------
     |      axis: {0 or 'index', 1 or 'columns'}, default 0
     |          index to direct ranking
     |      method : {'average', 'min', 'max', 'first', 'dense'}
     |          * average: average rank of group
     |          * min: lowest rank in group
     |          * max: highest rank in group
     |          * first: ranks assigned in order they appear in the array
     |          * dense: like 'min', but rank always increases by 1 between groups
     |      numeric_only : boolean, default None
     |          Include only float, int, boolean data. Valid only for DataFrame or
     |          Panel objects
     |      na_option : {'keep', 'top', 'bottom'}
     |          * keep: leave NA values where they are
     |          * top: smallest rank if ascending
     |          * bottom: smallest rank if descending
     |      ascending : boolean, default True
     |          False for ranks by high (1) to low (N)
     |      pct : boolean, default False
     |          Computes percentage rank of data
     |      
     |      Returns
     |      -------
     |      ranks : same type as caller
     |  
     |  skew
     |      
     |      Return unbiased skew over requested axis
     |      Normalized by N-1
     |      
     |      Parameters
     |      ----------
     |      axis : {index (0)}
     |      skipna : boolean, default True
     |          Exclude NA/null values. If an entire row/column is NA, the result
     |          will be NA
     |      level : int or level name, default None
     |          If the axis is a MultiIndex (hierarchical), count along a
     |          particular level, collapsing into a scalar
     |      numeric_only : boolean, default None
     |          Include only float, int, boolean columns. If None, will attempt to use
     |          everything, then use only numeric data. Not implemented for Series.
     |      
     |      Returns
     |      -------
     |      skew : scalar or Series (if level specified)
     |  
     |  take
     |      return Series corresponding to requested indices
     |      
     |      Parameters
     |      ----------
     |      indices : list / array of ints
     |      convert : translate negative to positive indices (default)
     |      
     |      Returns
     |      -------
     |      taken : Series
     |      
     |      See also
     |      --------
     |      numpy.ndarray.take
     |  
     |  tshift
     |      Shift the time index, using the index's frequency if available.
     |      
     |      Parameters
     |      ----------
     |      periods : int
     |          Number of periods to move, can be positive or negative
     |      freq : DateOffset, timedelta, or time rule string, default None
     |          Increment to use from the tseries module or time rule (e.g. 'EOM')
     |      axis : int or basestring
     |          Corresponds to the axis that contains the Index
     |      
     |      Notes
     |      -----
     |      If freq is not specified then tries to use the freq or inferred_freq
     |      attributes of the index. If neither of those attributes exist, a
     |      ValueError is thrown
     |      
     |      Returns
     |      -------
     |      shifted : NDFrame
     |  
     |  unique
     |      Return np.ndarray of unique values in the object.
     |      Significantly faster than numpy.unique. Includes NA values.
     |      The order of the original is preserved.
     |      
     |      Returns
     |      -------
     |      uniques : np.ndarray
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from GroupBy:
     |  
     |  backfill(self, limit=None)
     |      Backward fill the values
     |      
     |      Parameters
     |      ----------
     |      limit : integer, optional
     |          limit of how many values to fill
     |      
     |      See Also
     |      --------
     |      Series.fillna
     |      DataFrame.fillna
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  bfill = backfill(self, limit=None)
     |  
     |  cumcount(self, ascending=True)
     |      Number each item in each group from 0 to the length of that group - 1.
     |      
     |      Essentially this is equivalent to
     |      
     |      >>> self.apply(lambda x: Series(np.arange(len(x)), x.index))
     |      
     |      Parameters
     |      ----------
     |      ascending : bool, default True
     |          If False, number in reverse, from length of group - 1 to 0.
     |      
     |      Examples
     |      --------
     |      
     |      >>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],
     |      ...                   columns=['A'])
     |      >>> df
     |         A
     |      0  a
     |      1  a
     |      2  a
     |      3  b
     |      4  b
     |      5  a
     |      >>> df.groupby('A').cumcount()
     |      0    0
     |      1    1
     |      2    2
     |      3    0
     |      4    1
     |      5    3
     |      dtype: int64
     |      >>> df.groupby('A').cumcount(ascending=False)
     |      0    3
     |      1    2
     |      2    1
     |      3    1
     |      4    0
     |      5    0
     |      dtype: int64
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  cumprod(self, axis=0, *args, **kwargs)
     |      Cumulative product for each group
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  cumsum(self, axis=0, *args, **kwargs)
     |      Cumulative sum for each group
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  expanding(self, *args, **kwargs)
     |      Return an expanding grouper, providing expanding
     |      functionaility per group
     |      
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  ffill = pad(self, limit=None)
     |  
     |  first = f(self)
     |      Compute first of group values
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  head(self, n=5)
     |      Returns first n rows of each group.
     |      
     |      Essentially equivalent to ``.apply(lambda x: x.head(n))``,
     |      except ignores as_index flag.
     |      
     |      Examples
     |      --------
     |      
     |      >>> df = DataFrame([[1, 2], [1, 4], [5, 6]],
     |                         columns=['A', 'B'])
     |      >>> df.groupby('A', as_index=False).head(1)
     |         A  B
     |      0  1  2
     |      2  5  6
     |      >>> df.groupby('A').head(1)
     |         A  B
     |      0  1  2
     |      2  5  6
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  irow(self, i)
     |      DEPRECATED. Use ``.nth(i)`` instead
     |  
     |  last = f(self)
     |      Compute last of group values
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  max = f(self)
     |      Compute max of group values
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  mean(self, *args, **kwargs)
     |      Compute mean of groups, excluding missing values
     |      
     |      For multiple groupings, the result index will be a MultiIndex
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  median(self)
     |      Compute median of groups, excluding missing values
     |      
     |      For multiple groupings, the result index will be a MultiIndex
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  min = f(self)
     |      Compute min of group values
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  nth(self, n, dropna=None)
     |      Take the nth row from each group if n is an int, or a subset of rows
     |      if n is a list of ints.
     |      
     |      If dropna, will take the nth non-null row, dropna is either
     |      Truthy (if a Series) or 'all', 'any' (if a DataFrame);
     |      this is equivalent to calling dropna(how=dropna) before the
     |      groupby.
     |      
     |      Parameters
     |      ----------
     |      n : int or list of ints
     |          a single nth value for the row or a list of nth values
     |      dropna : None or str, optional
     |          apply the specified dropna operation before counting which row is
     |          the nth row. Needs to be None, 'any' or 'all'
     |      
     |      Examples
     |      --------
     |      
     |      >>> df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
     |      ...                    'B': [np.nan, 2, 3, 4, 5]}, columns=['A', 'B'])
     |      >>> g = df.groupby('A')
     |      >>> g.nth(0)
     |           B
     |      A
     |      1  NaN
     |      2  3.0
     |      >>> g.nth(1)
     |           B
     |      A
     |      1  2.0
     |      2  5.0
     |      >>> g.nth(-1)
     |           B
     |      A
     |      1  4.0
     |      2  5.0
     |      >>> g.nth([0, 1])
     |           B
     |      A
     |      1  NaN
     |      1  2.0
     |      2  3.0
     |      2  5.0
     |      
     |      Specifying ``dropna`` allows count ignoring NaN
     |      
     |      >>> g.nth(0, dropna='any')
     |           B
     |      A
     |      1  2.0
     |      2  3.0
     |      
     |      NaNs denote group exhausted when using dropna
     |      
     |      >>> g.nth(3, dropna='any')
     |          B
     |      A
     |      1 NaN
     |      2 NaN
     |      
     |      Specifying ``as_index=False`` in ``groupby`` keeps the original index.
     |      
     |      >>> df.groupby('A', as_index=False).nth(1)
     |         A    B
     |      1  1  2.0
     |      4  2  5.0
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  ohlc(self)
     |      Compute sum of values, excluding missing values
     |      For multiple groupings, the result index will be a MultiIndex
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  pad(self, limit=None)
     |      Forward fill the values
     |      
     |      Parameters
     |      ----------
     |      limit : integer, optional
     |          limit of how many values to fill
     |      
     |      See Also
     |      --------
     |      Series.fillna
     |      DataFrame.fillna
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  prod = f(self)
     |      Compute prod of group values
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  resample(self, rule, *args, **kwargs)
     |      Provide resampling when using a TimeGrouper
     |      Return a new grouper with our resampler appended
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  rolling(self, *args, **kwargs)
     |      Return a rolling grouper, providing rolling
     |      functionaility per group
     |      
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  sem(self, ddof=1)
     |      Compute standard error of the mean of groups, excluding missing values
     |      
     |      For multiple groupings, the result index will be a MultiIndex
     |      
     |      Parameters
     |      ----------
     |      ddof : integer, default 1
     |          degrees of freedom
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  shift(self, periods=1, freq=None, axis=0)
     |      Shift each group by periods observations
     |      
     |      Parameters
     |      ----------
     |      periods : integer, default 1
     |          number of periods to shift
     |      freq : frequency string
     |      axis : axis to shift, default 0
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  size(self)
     |      Compute group sizes
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  std(self, ddof=1, *args, **kwargs)
     |      Compute standard deviation of groups, excluding missing values
     |      
     |      For multiple groupings, the result index will be a MultiIndex
     |      
     |      Parameters
     |      ----------
     |      ddof : integer, default 1
     |          degrees of freedom
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  sum = f(self)
     |      Compute sum of group values
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  tail(self, n=5)
     |      Returns last n rows of each group
     |      
     |      Essentially equivalent to ``.apply(lambda x: x.tail(n))``,
     |      except ignores as_index flag.
     |      
     |      Examples
     |      --------
     |      
     |      >>> df = DataFrame([['a', 1], ['a', 2], ['b', 1], ['b', 2]],
     |                         columns=['A', 'B'])
     |      >>> df.groupby('A').tail(1)
     |         A  B
     |      1  a  2
     |      3  b  2
     |      >>> df.groupby('A').head(1)
     |         A  B
     |      0  a  1
     |      2  b  1
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  var(self, ddof=1, *args, **kwargs)
     |      Compute variance of groups, excluding missing values
     |      
     |      For multiple groupings, the result index will be a MultiIndex
     |      
     |      Parameters
     |      ----------
     |      ddof : integer, default 1
     |          degrees of freedom
     |      
     |      
     |      See also
     |      --------
     |      pandas.Series.groupby
     |      pandas.DataFrame.groupby
     |      pandas.Panel.groupby
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from _GroupBy:
     |  
     |  __getattr__(self, attr)
     |  
     |  __init__(self, obj, keys=None, axis=0, level=None, grouper=None, exclusions=None, selection=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  __iter__(self)
     |      Groupby iterator
     |      
     |      Returns
     |      -------
     |      Generator yielding sequence of (name, subsetted object)
     |      for each group
     |  
     |  __len__(self)
     |  
     |  __unicode__(self)
     |      Return a string representation for a particular object.
     |      
     |      Invoked by unicode(obj) in py2 only. Yields a Unicode String in both
     |      py2/py3.
     |  
     |  apply(self, func, *args, **kwargs)
     |      Apply function and combine results together in an intelligent way. The
     |      split-apply-combine combination rules attempt to be as common sense
     |      based as possible. For example:
     |      
     |      case 1:
     |      group DataFrame
     |      apply aggregation function (f(chunk) -> Series)
     |      yield DataFrame, with group axis having group labels
     |      
     |      case 2:
     |      group DataFrame
     |      apply transform function ((f(chunk) -> DataFrame with same indexes)
     |      yield DataFrame with resulting chunks glued together
     |      
     |      case 3:
     |      group Series
     |      apply function with f(chunk) -> DataFrame
     |      yield DataFrame with result of chunks glued together
     |      
     |      Parameters
     |      ----------
     |      func : function
     |      
     |      Notes
     |      -----
     |      See online documentation for full exposition on how to use apply.
     |      
     |      In the current implementation apply calls func twice on the
     |      first group to decide whether it can take a fast or slow code
     |      path. This can lead to unexpected behavior if func has
     |      side-effects, as they will take effect twice for the first
     |      group.
     |      
     |      
     |      See also
     |      --------
     |      aggregate, transform
     |  
     |  get_group(self, name, obj=None)
     |      Constructs NDFrame from group with provided name
     |      
     |      Parameters
     |      ----------
     |      name : object
     |          the name of the group to get as a DataFrame
     |      obj : NDFrame, default None
     |          the NDFrame to take the DataFrame out of.  If
     |          it is None, the object groupby was called on will
     |          be used
     |      
     |      Returns
     |      -------
     |      group : type of obj
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from _GroupBy:
     |  
     |  groups
     |      dict {group name -> group labels}
     |  
     |  indices
     |      dict {group name -> group indices}
     |  
     |  ngroups
     |  
     |  plot
     |      Class implementing the .plot attribute for groupby objects
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from pandas.core.base.PandasObject:
     |  
     |  __dir__(self)
     |      Provide method name lookup and completion
     |      Only provide 'public' methods
     |  
     |  __sizeof__(self)
     |      Generates the total memory usage for a object that returns
     |      either a value or Series of values
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from pandas.core.base.StringMixin:
     |  
     |  __bytes__(self)
     |      Return a string representation for a particular object.
     |      
     |      Invoked by bytes(obj) in py3 only.
     |      Yields a bytestring in both py2/py3.
     |  
     |  __repr__(self)
     |      Return a string representation for a particular object.
     |      
     |      Yields Bytestring in Py2, Unicode String in py3.
     |  
     |  __str__(self)
     |      Return a string representation for a particular Object
     |      
     |      Invoked by str(df) in both py2/py3.
     |      Yields Bytestring in Py2, Unicode String in py3.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from pandas.core.base.StringMixin:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  ----------------------------------------------------------------------
     |  Methods inherited from pandas.core.base.SelectionMixin:
     |  
     |  __getitem__(self, key)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors inherited from pandas.core.base.SelectionMixin:
     |  
     |  ndim
    
    grouped.mean()
    
    key1
    a   -0.198323
    b   -0.613947
    Name: data1, dtype: float64
    
    df['data1'].groupby(['a','a','b','b','a']).mean()
    
    a   -0.198323
    b   -0.613947
    Name: data1, dtype: float64
    
    p=df['data1'];p
    
    0   -0.288087
    1    0.804061
    2   -1.672009
    3    0.444115
    4   -1.110942
    Name: data1, dtype: float64
    
    p.groupby(df['key1']).mean()
    
    key1
    a   -0.198323
    b   -0.613947
    Name: data1, dtype: float64
    

    The important thing above is that the data(Series) has been aggregated according to the group key,producing a new Series that is now indexed by the unique values in the key1 column.

    The result index has the name 'key1' because the DataFrame column df['key1'] did!

    If instead we had passed multiple arrays as a list,we'd get something different:

    means=df['data1'].groupby([df['key1'],df['key2']]).mean();means
    
    key1  key2
    a     one    -0.699514
          two     0.804061
    b     one    -1.672009
          two     0.444115
    Name: data1, dtype: float64
    
    df['data1'].groupby([df['key2'],df['key1']]).mean()
    
    key2  key1
    one   a      -0.699514
          b      -1.672009
    two   a       0.804061
          b       0.444115
    Name: data1, dtype: float64
    

    Here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

    means.unstack()
    
    key2 one two
    key1
    a -0.699514 0.804061
    b -1.672009 0.444115

    In this example,the group keys are all Series,though they could be any arrays of the right length:

    states=np.array(['Ohio','Califonia','Ohio','Ohio','Califonia'])
    
    years=np.array([2005,2005,2006,2005,2006])
    
    df['data1'].groupby([states,years]).mean()
    
    Califonia  2005    0.804061
               2006   -1.110942
    Ohio       2005    0.078014
               2006   -1.672009
    Name: data1, dtype: float64
    
    • The second method:
      Frequently the grouping information is found in the same DataFrame as the data you want to work on.In this case,you can pass column names(whether those are strings,numbers or other python objects) as the group keys:
    df.groupby('key1').mean()
    
    data1 data2
    key1
    a -0.198323 -0.039899
    b -0.613947 1.384352
    df.groupby(['key1','key2']).mean()
    
    data1 data2
    key1 key2
    a one -0.699514 -0.386868
    two 0.804061 0.654039
    b one -1.672009 1.780028
    two 0.444115 0.988676

    You may have noticed in the first case of df.groupby('key').mean() that there is no key2 column in the result.**Because df['key2'] is not numeric data,it is said to be a nuisance column,which is therefor excluded from the result.By default,all of teh numeric columns are aggregated,though it is possible to filter down to a subset.

    help(df.groupby(['key1','key2']).size)
    
    Help on method size in module pandas.core.groupby:
    
    size() method of pandas.core.groupby.DataFrameGroupBy instance
        Compute group sizes
        
        See also
        --------
        pandas.Series.groupby
        pandas.DataFrame.groupby
        pandas.Panel.groupby
    

    A generally useful Groupby method is size,which returns a Series containing group sizes:

    df.groupby(['key1','key2']).size()
    
    key1  key2
    a     one     2
          two     1
    b     one     1
          two     1
    dtype: int64
    
    df
    
    data1 data2 key1 key2
    0 -0.288087 -1.355619 a one
    1 0.804061 0.654039 a two
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    4 -1.110942 0.581883 a one

    We can see from above,the number of ('a','one') is 2 indeed...

    Iterating over groups

    The Groupby object supports iteration,generating a sequence of 2-tuples containning the group name along with the chunk of data.

    for name,group in df.groupby('key1'):
        print(name)
        print(group)
    
    a
          data1     data2 key1 key2
    0 -0.288087 -1.355619    a  one
    1  0.804061  0.654039    a  two
    4 -1.110942  0.581883    a  one
    b
          data1     data2 key1 key2
    2 -1.672009  1.780028    b  one
    3  0.444115  0.988676    b  two
    

    In the case of multiple keys,the first element in the tuple will be a tuple of key values.

    for (k1,k2),group in df.groupby(['key1','key2']):
        print((k1,k2))
        print(group)
    
    ('a', 'one')
          data1     data2 key1 key2
    0 -0.288087 -1.355619    a  one
    4 -1.110942  0.581883    a  one
    ('a', 'two')
          data1     data2 key1 key2
    1  0.804061  0.654039    a  two
    ('b', 'one')
          data1     data2 key1 key2
    2 -1.672009  1.780028    b  one
    ('b', 'two')
          data1     data2 key1 key2
    3  0.444115  0.988676    b  two
    
    df
    
    data1 data2 key1 key2
    0 -0.288087 -1.355619 a one
    1 0.804061 0.654039 a two
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    4 -1.110942 0.581883 a one

    A useful recipe to get the pieces of data through computing a dict of the data pieces as a one-liner:

    pieces=dict(list(df.groupby('key1')))
    
    pieces
    
    {'a':       data1     data2 key1 key2
     0 -0.288087 -1.355619    a  one
     1  0.804061  0.654039    a  two
     4 -1.110942  0.581883    a  one, 'b':       data1     data2 key1 key2
     2 -1.672009  1.780028    b  one
     3  0.444115  0.988676    b  two}
    
    pieces['b']
    
    data1 data2 key1 key2
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    list(df.groupby('key1'))
    
    [('a',       data1     data2 key1 key2
      0 -0.288087 -1.355619    a  one
      1  0.804061  0.654039    a  two
      4 -1.110942  0.581883    a  one), ('b',       data1     data2 key1 key2
      2 -1.672009  1.780028    b  one
      3  0.444115  0.988676    b  two)]
    
    dict([('a',1),('b',2)])
    
    {'a': 1, 'b': 2}
    
    dict([('a',1)])
    
    {'a': 1}
    
    dict([(1,2)])
    
    {1: 2}
    

    Above: dict's built-in creation.

    df
    
    data1 data2 key1 key2
    0 -0.288087 -1.355619 a one
    1 0.804061 0.654039 a two
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    4 -1.110942 0.581883 a one
    df.dtypes
    
    data1    float64
    data2    float64
    key1      object
    key2      object
    dtype: object
    
    grouped=df.groupby(df.dtypes,axis=1)  # notice here,axis=1,df is divided into 2 DataFrame accroding `dtype`.
    
    for dtype,group in grouped:
        print(dtype)
        print(group)
    
    float64
          data1     data2
    0 -0.288087 -1.355619
    1  0.804061  0.654039
    2 -1.672009  1.780028
    3  0.444115  0.988676
    4 -1.110942  0.581883
    object
      key1 key2
    0    a  one
    1    a  two
    2    b  one
    3    b  two
    4    a  one
    
    pp=dict(list(grouped))
    
    pp
    
    {dtype('float64'):       data1     data2
     0 -0.288087 -1.355619
     1  0.804061  0.654039
     2 -1.672009  1.780028
     3  0.444115  0.988676
     4 -1.110942  0.581883, dtype('O'):   key1 key2
     0    a  one
     1    a  two
     2    b  one
     3    b  two
     4    a  one}
    
    i,j=pp.keys()
    
    pp[i]
    
    data1 data2
    0 -0.288087 -1.355619
    1 0.804061 0.654039
    2 -1.672009 1.780028
    3 0.444115 0.988676
    4 -1.110942 0.581883
    pp[j]
    
    key1 key2
    0 a one
    1 a two
    2 b one
    3 b two
    4 a one
    type(i)
    
    numpy.dtype
    
    pp[np.dtype('object')]
    
    key1 key2
    0 a one
    1 a two
    2 b one
    3 b two
    4 a one

    Selecting a column or subset of columns

    a=list(df.groupby('key1')['data1']);a
    
    [('a', 0   -0.288087
      1    0.804061
      4   -1.110942
      Name: data1, dtype: float64), ('b', 2   -1.672009
      3    0.444115
      Name: data1, dtype: float64)]
    
    list(df['data1'].groupby(df['key1']))
    
    [('a', 0   -0.288087
      1    0.804061
      4   -1.110942
      Name: data1, dtype: float64), ('b', 2   -1.672009
      3    0.444115
      Name: data1, dtype: float64)]
    
    df
    
    data1 data2 key1 key2
    0 -0.288087 -1.355619 a one
    1 0.804061 0.654039 a two
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    4 -1.110942 0.581883 a one
    df.groupby('key1').size()
    
    key1
    a    3
    b    2
    dtype: int64
    
    a=df.groupby('key1')['data1'];a
    
    <pandas.core.groupby.SeriesGroupBy object at 0x000001DDB5670F60>
    
    b=df.groupby('key1')[['data1']];a
    
    <pandas.core.groupby.SeriesGroupBy object at 0x000001DDB5670F60>
    
    aa=df['data1'].groupby(df['key1']);aa
    
    <pandas.core.groupby.SeriesGroupBy object at 0x000001DDB56707B8>
    
    bb=df[['data1']].groupby(df['key1']);bb
    
    <pandas.core.groupby.DataFrameGroupBy object at 0x000001DDB56703C8>
    

    Indexing a groupby object created from a DataFrame with a column name or array of column names has the efffect of column subsetting for aggregation.

    df
    
    data1 data2 key1 key2
    0 -0.288087 -1.355619 a one
    1 0.804061 0.654039 a two
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    4 -1.110942 0.581883 a one
    list(df.groupby(['key1','key2'])['data2'])
    
    [(('a', 'one'), 0   -1.355619
      4    0.581883
      Name: data2, dtype: float64), (('a', 'two'), 1    0.654039
      Name: data2, dtype: float64), (('b', 'one'), 2    1.780028
      Name: data2, dtype: float64), (('b', 'two'), 3    0.988676
      Name: data2, dtype: float64)]
    
    df.groupby(['key1','key2'])[['data2']].mean()
    
    data2
    key1 key2
    a one -0.386868
    two 0.654039
    b one 1.780028
    two 0.988676
    df[['data1','data2']].mean()
    
    data1   -0.364572
    data2    0.529801
    dtype: float64
    

    Grouping with Dicts and Series

    Grouping information may exist in a form other than an array.

    people=pd.DataFrame(np.random.randn(5,5),columns=list('abcde'),index=['Joe','Steve','Wes','Jim','Travis']);people
    
    a b c d e
    Joe 0.696631 -0.491001 -0.065796 -1.392325 -0.651280
    Steve 0.965853 -1.694954 2.142343 -0.969325 1.129057
    Wes -0.145624 -0.548971 -0.893291 2.421140 0.305064
    Jim 0.434581 1.189071 0.455694 -0.929888 1.193285
    Travis -0.321174 -1.400983 -0.529033 0.037673 1.221623

    Now suppose we have a group correspondence for the columns and want to sum together the columns by group.

    mapping={'a':'red','b':'red',
            'c':'blue','d':'blue',
            'e':'red','f':'orange'}
    
    by_column=people.groupby(mapping,axis=1)
    
    by_column.sum()
    
    blue red
    Joe -1.458121 -0.445651
    Steve 1.173019 0.399957
    Wes 1.527849 -0.389532
    Jim -0.474194 2.816937
    Travis -0.491360 -0.500533

    The same functionality holds for Series,which can be viewed as a fixed-size mapping.

    
    map_series=pd.Series(mapping);map_series
    
    a       red
    b       red
    c      blue
    d      blue
    e       red
    f    orange
    dtype: object
    
    people.groupby(map_series,axis=1).count()
    
    blue red
    Joe 2 3
    Steve 2 3
    Wes 2 3
    Jim 2 3
    Travis 2 3
    people.groupby(['red','red','blue','blue','red']).mean()
    
    a b c d e
    blue 0.144479 0.320050 -0.218799 0.745626 0.749174
    red 0.447103 -1.195646 0.515838 -0.774659 0.566467
    people.groupby(['red','red','blue','blue','red'],axis=1).mean()
    
    blue red
    Joe -0.729060 -0.148550
    Steve 0.586509 0.133319
    Wes 0.763924 -0.129844
    Jim -0.237097 0.938979
    Travis -0.245680 -0.166844

    Grouping with function

    Using python functions is a more generic way of defining a group mappingg compared with a dict or Series.Any function passed as a group key will be called once per index value,with the return values being used as the group names.

    Suppose you want to group by the length of the names;while you compute an array of string lengths,it's simpler to just pass the len function.

    people.groupby(len).sum()
    
    a b c d e
    3 0.985588 0.149098 -0.503394 0.098927 0.847068
    5 0.965853 -1.694954 2.142343 -0.969325 1.129057
    6 -0.321174 -1.400983 -0.529033 0.037673 1.221623

    Mixing functions with arrays,dicts,or Series is not a problem as everything gets converted to arrays internaly:

    key_list=['one','one','one','two','two']
    
    key_test=[1,1,1,1,1]
    
    people.groupby([len,key_list]).min()
    
    a b c d e
    3 one -0.145624 -0.548971 -0.893291 -1.392325 -0.651280
    two 0.434581 1.189071 0.455694 -0.929888 1.193285
    5 one 0.965853 -1.694954 2.142343 -0.969325 1.129057
    6 two -0.321174 -1.400983 -0.529033 0.037673 1.221623
    people.groupby([len,key_test]).min()
    
    a b c d e
    3 1 -0.145624 -0.548971 -0.893291 -1.392325 -0.651280
    5 1 0.965853 -1.694954 2.142343 -0.969325 1.129057
    6 1 -0.321174 -1.400983 -0.529033 0.037673 1.221623

    Grouping by Index levels

    A final vonvenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an index.

    columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3,]],names=['City','tenor'])
    
    hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns);hier_df
    
    City US JP
    tenor 1 3 5 1 3
    0 -1.878462 1.360321 -1.444771 0.883927 0.995222
    1 -1.082905 1.061055 -0.938387 -0.391143 1.383284
    2 -0.626484 -0.298127 1.144298 0.941223 1.163550
    3 -0.272500 -3.441699 0.804467 -0.029275 -0.003555

    To group by level,pass the level number or name using level keyword.

    hier_df.groupby(level='City',axis=1).count()
    
    City JP US
    0 2 3
    1 2 3
    2 2 3
    3 2 3

    Data aggregation

    Aggregations refer to any data transformation that produces scalar values from arrays.

    description=['Number of non-NAN values in the group',
                'Sum of non-NA values',
                'Mean of non-NA values',
                'Arithmetic median off non-NAN values',
                'Unbiased(n-1 denominator)standard deviation and variance',
                'Minimun and maximum of non-NA values',
                'Product of non-NA values',
                'First and last non-NA value']
    

    Tips:Custom aggregation functions are generally much slower than the optimized functions found in above table.This is because there is some extra overhead(function calls,data rearrangement) in constructing the intermeediate chunks.

    index=pd.Index(['count','sum','mean','median','std,var','min,max','prod','first,last'],names='Function name')
    
    pd.DataFrame(description,index=index)
    
    0
    count Number of non-NAN values in the group
    sum Sum of non-NA values
    mean Mean of non-NA values
    median Arithmetic median off non-NAN values
    std,var Unbiased(n-1 denominator)standard deviation an...
    min,max Minimun and maximum of non-NA values
    prod Product of non-NA values
    first,last First and last non-NA value

    To use your own aggregation functions,pass any function that aggregates an array to the aggregate or agg method.

    grouped=df.groupby('key1')
    
    def peak_to_peak(arr):
        return arr.max()-arr.min()
    
    grouped.agg(peak_to_peak)
    
    data1 data2
    key1
    a 1.915003 2.009658
    b 2.116124 0.791351
    df
    
    data1 data2 key1 key2
    0 -0.288087 -1.355619 a one
    1 0.804061 0.654039 a two
    2 -1.672009 1.780028 b one
    3 0.444115 0.988676 b two
    4 -1.110942 0.581883 a one
    def length(arr):
        l=0
        for i in arr:
            l+=i**2
        return l**0.5
            
    
    grouped.agg(length)
    
    data1 data2
    key1
    a 1.401321 1.613710
    b 1.729986 2.036168
    grouped.describe()
    
    data1 data2
    key1
    a count 3.000000 3.000000
    mean -0.198323 -0.039899
    std 0.960652 1.140018
    min -1.110942 -1.355619
    25% -0.699514 -0.386868
    50% -0.288087 0.581883
    75% 0.257987 0.617961
    max 0.804061 0.654039
    b count 2.000000 2.000000
    mean -0.613947 1.384352
    std 1.496326 0.559570
    min -1.672009 0.988676
    25% -1.142978 1.186514
    50% -0.613947 1.384352
    75% -0.084916 1.582190
    max 0.444115 1.780028

    You may notice some methods like describe also work,even though they are not aggregations.

    Column-wise and multiple function application

    tips=pd.read_csv(r'./pydata-book-2nd-edition/examples/tips.csv')
    
    tips.head()
    
    total_bill tip smoker day time size
    0 16.99 1.01 No Sun Dinner 2
    1 10.34 1.66 No Sun Dinner 3
    2 21.01 3.50 No Sun Dinner 3
    3 23.68 3.31 No Sun Dinner 2
    4 24.59 3.61 No Sun Dinner 4
    tips['tip_pct']=tips['tip']/tips['total_bill']
    

    As we have seen,aggregating a Series or all of the columns of a DataFrame is a matter of using aggregae with the desired function or calling a method like mean or std.However,you may want to aggregate using a different function depending on the column,or multiple functions at once.

    grouped=tips.groupby(['day','smoker'])
    
    group_pct=grouped['tip_pct']
    
    grouped.size()
    
    day   smoker
    Fri   No         4
          Yes       15
    Sat   No        45
          Yes       42
    Sun   No        57
          Yes       19
    Thur  No        45
          Yes       17
    dtype: int64
    
    group_pct.agg('mean')
    
    day   smoker
    Fri   No        0.151650
          Yes       0.174783
    Sat   No        0.158048
          Yes       0.147906
    Sun   No        0.160113
          Yes       0.187250
    Thur  No        0.160298
          Yes       0.163863
    Name: tip_pct, dtype: float64
    

    If you pass a list of functions or function names instead,you get back a DataFrame with column names taken from the functions:

    group_pct.agg(['mean','std',peak_to_peak])
    
    mean std peak_to_peak
    day smoker
    Fri No 0.151650 0.028123 0.067349
    Yes 0.174783 0.051293 0.159925
    Sat No 0.158048 0.039767 0.235193
    Yes 0.147906 0.061375 0.290095
    Sun No 0.160113 0.042347 0.193226
    Yes 0.187250 0.154134 0.644685
    Thur No 0.160298 0.038774 0.193350
    Yes 0.163863 0.039389 0.151240
    df=pd.DataFrame(np.arange(12).reshape(3,4),index=['one','two','three'],columns=['o','t','th','f']);df
    
    o t th f
    one 0 1 2 3
    two 4 5 6 7
    three 8 9 10 11
    list(df.groupby([0,0,1])['f'])
    
    [(0, one    3
      two    7
      Name: f, dtype: int32), (1, three    11
      Name: f, dtype: int32)]
    

    If you pass a list of (name,function) tuples,the first element of each tuple will be used as the DataFrame column names.

    group_pct.agg([('foo','mean'),('bar',np.std)])
    
    foo bar
    day smoker
    Fri No 0.151650 0.028123
    Yes 0.174783 0.051293
    Sat No 0.158048 0.039767
    Yes 0.147906 0.061375
    Sun No 0.160113 0.042347
    Yes 0.187250 0.154134
    Thur No 0.160298 0.038774
    Yes 0.163863 0.039389

    With a DataFrame you have more options,as you can specify a list of functions to apply to all of the columns or different functions per column.To start ,suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:

    functions=['count','mean','max']
    
    result=grouped['tip_pct','total_bill'].agg(functions);result
    
    tip_pct total_bill
    count mean max count mean max
    day smoker
    Fri No 4 0.151650 0.187735 4 18.420000 22.75
    Yes 15 0.174783 0.263480 15 16.813333 40.17
    Sat No 45 0.158048 0.291990 45 19.661778 48.33
    Yes 42 0.147906 0.325733 42 21.276667 50.81
    Sun No 57 0.160113 0.252672 57 20.506667 48.17
    Yes 19 0.187250 0.710345 19 24.120000 45.35
    Thur No 45 0.160298 0.266312 45 17.113111 41.19
    Yes 17 0.163863 0.241255 17 19.190588 43.11

    As you can see,the resulting DataFrame has a hierarchical columns,the same as you would get aggregating each column separately and using concat to glue the results together using the column names as the keys argument

    As before,a list of tuples with custom names can be passed:

    ftuples=[('Durchs','mean'),('adkfd',np.var)]
    
    grouped['tip_pct','total_bill'].agg(ftuples)
    
    tip_pct total_bill
    Durchs adkfd Durchs adkfd
    day smoker
    Fri No 0.151650 0.000791 18.420000 25.596333
    Yes 0.174783 0.002631 16.813333 82.562438
    Sat No 0.158048 0.001581 19.661778 79.908965
    Yes 0.147906 0.003767 21.276667 101.387535
    Sun No 0.160113 0.001793 20.506667 66.099980
    Yes 0.187250 0.023757 24.120000 109.046044
    Thur No 0.160298 0.001503 17.113111 59.625081
    Yes 0.163863 0.001551 19.190588 69.808518

    Now,suppose you want to apply potentially different functions to one or more of the columns.To do this,pass a dict to agg that contains a mappingg of column names to any of the function specifications listed so far:

    grouped.agg({'tip':np.max,'size':'sum'})
    
    tip size
    day smoker
    Fri No 3.50 9
    Yes 4.73 31
    Sat No 9.00 115
    Yes 10.00 104
    Sun No 6.00 167
    Yes 6.50 49
    Thur No 6.70 112
    Yes 5.00 40
    grouped.agg(ftuples)
    
    total_bill tip size tip_pct
    Durchs adkfd Durchs adkfd Durchs adkfd Durchs adkfd
    day smoker
    Fri No 18.420000 25.596333 2.812500 0.807292 2.250000 0.250000 0.151650 0.000791
    Yes 16.813333 82.562438 2.714000 1.161369 2.066667 0.352381 0.174783 0.002631
    Sat No 19.661778 79.908965 3.102889 2.696453 2.555556 0.616162 0.158048 0.001581
    Yes 21.276667 101.387535 2.875476 2.658791 2.476190 0.743322 0.147906 0.003767
    Sun No 20.506667 66.099980 3.167895 1.500099 2.929825 1.066416 0.160113 0.001793
    Yes 24.120000 109.046044 3.516842 1.590501 2.578947 0.812865 0.187250 0.023757
    Thur No 17.113111 59.625081 2.673778 1.645997 2.488889 1.391919 0.160298 0.001503
    Yes 19.190588 69.808518 3.030000 1.239863 2.352941 0.492647 0.163863 0.001551
    grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
    
    tip_pct size
    min max mean std sum
    day smoker
    Fri No 0.120385 0.187735 0.151650 0.028123 9
    Yes 0.103555 0.263480 0.174783 0.051293 31
    Sat No 0.056797 0.291990 0.158048 0.039767 115
    Yes 0.035638 0.325733 0.147906 0.061375 104
    Sun No 0.059447 0.252672 0.160113 0.042347 167
    Yes 0.065660 0.710345 0.187250 0.154134 49
    Thur No 0.072961 0.266312 0.160298 0.038774 112
    Yes 0.090014 0.241255 0.163863 0.039389 40

    A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

    Returning Aggregated data without row index

    In all of the examples up until now,the aggregated data comes back with an index,potentially hierarchical,composed from the unique group key combinations.Since this is not always desirable,you can disable this behaviour in most cases by passing as_index=False to groupby.

    tips.groupby(['day','smoker'],as_index=False).mean()
    
    day smoker total_bill tip size tip_pct
    0 Fri No 18.420000 2.812500 2.250000 0.151650
    1 Fri Yes 16.813333 2.714000 2.066667 0.174783
    2 Sat No 19.661778 3.102889 2.555556 0.158048
    3 Sat Yes 21.276667 2.875476 2.476190 0.147906
    4 Sun No 20.506667 3.167895 2.929825 0.160113
    5 Sun Yes 24.120000 3.516842 2.578947 0.187250
    6 Thur No 17.113111 2.673778 2.488889 0.160298
    7 Thur Yes 19.190588 3.030000 2.352941 0.163863
    tips.groupby(['day','smoker']).mean()
    
    total_bill tip size tip_pct
    day smoker
    Fri No 18.420000 2.812500 2.250000 0.151650
    Yes 16.813333 2.714000 2.066667 0.174783
    Sat No 19.661778 3.102889 2.555556 0.158048
    Yes 21.276667 2.875476 2.476190 0.147906
    Sun No 20.506667 3.167895 2.929825 0.160113
    Yes 24.120000 3.516842 2.578947 0.187250
    Thur No 17.113111 2.673778 2.488889 0.160298
    Yes 19.190588 3.030000 2.352941 0.163863

    Apply:General split-apply-combine

    The most general-purpose GroupBy method is apply,which is the subject of the rest of this section.

    tips.head()
    
    total_bill tip smoker day time size tip_pct
    0 16.99 1.01 No Sun Dinner 2 0.059447
    1 10.34 1.66 No Sun Dinner 3 0.160542
    2 21.01 3.50 No Sun Dinner 3 0.166587
    3 23.68 3.31 No Sun Dinner 2 0.139780
    4 24.59 3.61 No Sun Dinner 4 0.146808

    Returning to the tipping dataset from before,suppose you want to select the top five tip_pct values by group.First,write a function that selects the rows with the largest values in a particular column:

    def top(df,n=5,column='tip_pct'):
        return df.sort_values(by=column)[-n:]
    
    tips.sort_values(by='tip_pct')[-5:]
    
    total_bill tip smoker day time size tip_pct
    183 23.17 6.50 Yes Sun Dinner 4 0.280535
    232 11.61 3.39 No Sat Dinner 2 0.291990
    67 3.07 1.00 Yes Sat Dinner 1 0.325733
    178 9.60 4.00 Yes Sun Dinner 2 0.416667
    172 7.25 5.15 Yes Sun Dinner 2 0.710345
    top(tips,n=6)
    
    total_bill tip smoker day time size tip_pct
    109 14.31 4.00 Yes Sat Dinner 2 0.279525
    183 23.17 6.50 Yes Sun Dinner 4 0.280535
    232 11.61 3.39 No Sat Dinner 2 0.291990
    67 3.07 1.00 Yes Sat Dinner 1 0.325733
    178 9.60 4.00 Yes Sun Dinner 2 0.416667
    172 7.25 5.15 Yes Sun Dinner 2 0.710345

    Now,if we group by smoker,and call apply with this function.we get:

    tips.groupby('smoker').apply(top)
    
    total_bill tip smoker day time size tip_pct
    smoker
    No 88 24.71 5.85 No Thur Lunch 2 0.236746
    185 20.69 5.00 No Sun Dinner 5 0.241663
    51 10.29 2.60 No Sun Dinner 2 0.252672
    149 7.51 2.00 No Thur Lunch 2 0.266312
    232 11.61 3.39 No Sat Dinner 2 0.291990
    Yes 109 14.31 4.00 Yes Sat Dinner 2 0.279525
    183 23.17 6.50 Yes Sun Dinner 4 0.280535
    67 3.07 1.00 Yes Sat Dinner 1 0.325733
    178 9.60 4.00 Yes Sun Dinner 2 0.416667
    172 7.25 5.15 Yes Sun Dinner 2 0.710345

    If you pass a function to apply that takes other arguments or keywords,you can pass these after the function:

    tips.groupby(['smoker','day']).apply(top,n=2,column='total_bill')
    
    total_bill tip smoker day time size tip_pct
    smoker day
    No Fri 91 22.49 3.50 No Fri Dinner 2 0.155625
    94 22.75 3.25 No Fri Dinner 2 0.142857
    Sat 59 48.27 6.73 No Sat Dinner 4 0.139424
    212 48.33 9.00 No Sat Dinner 4 0.186220
    Sun 112 38.07 4.00 No Sun Dinner 3 0.105070
    156 48.17 5.00 No Sun Dinner 6 0.103799
    Thur 85 34.83 5.17 No Thur Lunch 4 0.148435
    142 41.19 5.00 No Thur Lunch 5 0.121389
    Yes Fri 90 28.97 3.00 Yes Fri Dinner 2 0.103555
    95 40.17 4.73 Yes Fri Dinner 4 0.117750
    Sat 102 44.30 2.50 Yes Sat Dinner 3 0.056433
    170 50.81 10.00 Yes Sat Dinner 3 0.196812
    Sun 184 40.55 3.00 Yes Sun Dinner 2 0.073983
    182 45.35 3.50 Yes Sun Dinner 3 0.077178
    Thur 83 32.68 5.00 Yes Thur Lunch 2 0.152999
    197 43.11 5.00 Yes Thur Lunch 4 0.115982

    Suppressing the Group keys

    In the preceding examples,you see that the resulting object has a hierarchical index formed from the group keys along the indexes of each piece of the original object.You can disable this by passing group_keys=False to groupby.

    tips.groupby('smoker',group_keys=False).apply(top)
    
    total_bill tip smoker day time size tip_pct
    88 24.71 5.85 No Thur Lunch 2 0.236746
    185 20.69 5.00 No Sun Dinner 5 0.241663
    51 10.29 2.60 No Sun Dinner 2 0.252672
    149 7.51 2.00 No Thur Lunch 2 0.266312
    232 11.61 3.39 No Sat Dinner 2 0.291990
    109 14.31 4.00 Yes Sat Dinner 2 0.279525
    183 23.17 6.50 Yes Sun Dinner 4 0.280535
    67 3.07 1.00 Yes Sat Dinner 1 0.325733
    178 9.60 4.00 Yes Sun Dinner 2 0.416667
    172 7.25 5.15 Yes Sun Dinner 2 0.710345

    **Notice the difference between agg and apply:agg aggregates GroupBy object's every column into one value,while apply apply function to every column of GroupBy object and the result may not to be just one value like agg.

    tips.groupby('smoker').agg('max')
    
    total_bill tip day time size tip_pct
    smoker
    No 48.33 9.0 Thur Lunch 6 0.291990
    Yes 50.81 10.0 Thur Lunch 5 0.710345
    tips.groupby('smoker').apply(max)
    
    total_bill tip smoker day time size tip_pct
    smoker
    No 48.33 9.0 No Thur Lunch 6 0.291990
    Yes 50.81 10.0 Yes Thur Lunch 5 0.710345

    The above two methods return just the same result.To some extent,aggg is a specific version of apply.

    Quantile and bucket analysis

    Combining qcut or cut with groupby makes it convenient to perform bucket or quantile analysis on a dataset.

    frame=pd.DataFrame({'data1':np.random.randn(1000),
                       'data2':np.random.randn(1000)})
    
    quantiles=pd.cut(frame.data1,4)
    
    quantiles[:10]
    
    0    (0.00174, 1.893]
    1    (0.00174, 1.893]
    2    (0.00174, 1.893]
    3     (-3.788, -1.89]
    4    (0.00174, 1.893]
    5    (0.00174, 1.893]
    6    (0.00174, 1.893]
    7     (-3.788, -1.89]
    8    (0.00174, 1.893]
    9    (0.00174, 1.893]
    Name: data1, dtype: category
    Categories (4, object): [(-3.788, -1.89] < (-1.89, 0.00174] < (0.00174, 1.893] < (1.893, 3.784]]
    
    help(pd.cut)
    
    Help on function cut in module pandas.tools.tile:
    
    cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
        Return indices of half-open bins to which each value of `x` belongs.
        
        Parameters
        ----------
        x : array-like
            Input array to be binned. It has to be 1-dimensional.
        bins : int or sequence of scalars
            If `bins` is an int, it defines the number of equal-width bins in the
            range of `x`. However, in this case, the range of `x` is extended
            by .1% on each side to include the min or max values of `x`. If
            `bins` is a sequence it defines the bin edges allowing for
            non-uniform bin width. No extension of the range of `x` is done in
            this case.
        right : bool, optional
            Indicates whether the bins include the rightmost edge or not. If
            right == True (the default), then the bins [1,2,3,4] indicate
            (1,2], (2,3], (3,4].
        labels : array or boolean, default None
            Used as labels for the resulting bins. Must be of the same length as
            the resulting bins. If False, return only integer indicators of the
            bins.
        retbins : bool, optional
            Whether to return the bins or not. Can be useful if bins is given
            as a scalar.
        precision : int
            The precision at which to store and display the bins labels
        include_lowest : bool
            Whether the first interval should be left-inclusive or not.
        
        Returns
        -------
        out : Categorical or Series or array of integers if labels is False
            The return type (Categorical or Series) depends on the input: a Series
            of type category if input is a Series else Categorical. Bins are
            represented as categories when categorical data is returned.
        bins : ndarray of floats
            Returned only if `retbins` is True.
        
        Notes
        -----
        The `cut` function can be useful for going from a continuous variable to
        a categorical variable. For example, `cut` could convert ages to groups
        of age ranges.
        
        Any NA values will be NA in the result.  Out of bounds values will be NA in
        the resulting Categorical object
        
        
        Examples
        --------
        >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
        ([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533],
          (6.533, 9.7], (0.191, 3.367]]
        Categories (3, object): [(0.191, 3.367] < (3.367, 6.533] < (6.533, 9.7]],
        array([ 0.1905    ,  3.36666667,  6.53333333,  9.7       ]))
        >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3,
                   labels=["good","medium","bad"])
        [good, good, good, medium, bad, good]
        Categories (3, object): [good < medium < bad]
        >>> pd.cut(np.ones(5), 4, labels=False)
        array([1, 1, 1, 1, 1], dtype=int64)
    
    help(pd.qcut)
    
    Help on function qcut in module pandas.tools.tile:
    
    qcut(x, q, labels=None, retbins=False, precision=3)
        Quantile-based discretization function. Discretize variable into
        equal-sized buckets based on rank or based on sample quantiles. For example
        1000 values for 10 quantiles would produce a Categorical object indicating
        quantile membership for each data point.
        
        Parameters
        ----------
        x : ndarray or Series
        q : integer or array of quantiles
            Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately
            array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles
        labels : array or boolean, default None
            Used as labels for the resulting bins. Must be of the same length as
            the resulting bins. If False, return only integer indicators of the
            bins.
        retbins : bool, optional
            Whether to return the bins or not. Can be useful if bins is given
            as a scalar.
        precision : int
            The precision at which to store and display the bins labels
        
        Returns
        -------
        out : Categorical or Series or array of integers if labels is False
            The return type (Categorical or Series) depends on the input: a Series
            of type category if input is a Series else Categorical. Bins are
            represented as categories when categorical data is returned.
        bins : ndarray of floats
            Returned only if `retbins` is True.
        
        Notes
        -----
        Out of bounds values will be NA in the resulting Categorical object
        
        Examples
        --------
        >>> pd.qcut(range(5), 4)
        [[0, 1], [0, 1], (1, 2], (2, 3], (3, 4]]
        Categories (4, object): [[0, 1] < (1, 2] < (2, 3] < (3, 4]]
        >>> pd.qcut(range(5), 3, labels=["good","medium","bad"])
        [good, good, medium, bad, bad]
        Categories (3, object): [good < medium < bad]
        >>> pd.qcut(range(5), 4, labels=False)
        array([0, 0, 1, 2, 3], dtype=int64)
    

    The Categorical object returned by cut can be passed directly to groupby.So we could compute a set of statistics for the data2 like so:

    def get_status(group):
        return {'min':group.min(),'max':group.max(),
               'count':group.count(),'mean':group.mean()}
    
    grouped=frame.data2.groupby(quantiles)
    
    all(frame.data2==frame['data2'])
    
    True
    
    df=pd.Series([1,2,3,5])
    
    pd.cut(df,2)
    
    0    (0.996, 3]
    1    (0.996, 3]
    2    (0.996, 3]
    3        (3, 5]
    dtype: category
    Categories (2, object): [(0.996, 3] < (3, 5]]
    
    pd.qcut(df,2)
    
    0    [1, 2.5]
    1    [1, 2.5]
    2    (2.5, 5]
    3    (2.5, 5]
    dtype: category
    Categories (2, object): [[1, 2.5] < (2.5, 5]]
    
    fig=plt.figure()
    
    df.plot()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x1ddb56f2908>
    
    grouped=frame.data2.groupby(quantiles)
    
    grouped.size()
    
    data1
    (-3.788, -1.89]      34
    (-1.89, 0.00174]    443
    (0.00174, 1.893]    492
    (1.893, 3.784]       31
    dtype: int64
    

    Here,pd.cut(frame.data1,4) returns 1000 rows of Categorical object dividing data1 into 4 groups

    grouped.apply(get_status)
    
    data1                  
    (-3.788, -1.89]   count     34.000000
                      max        1.935178
                      mean      -0.147920
                      min       -1.843208
    (-1.89, 0.00174]  count    443.000000
                      max        2.547017
                      mean      -0.020845
                      min       -2.779875
    (0.00174, 1.893]  count    492.000000
                      max        2.729491
                      mean      -0.066996
                      min       -3.854003
    (1.893, 3.784]    count     31.000000
                      max        1.533001
                      mean      -0.070277
                      min       -1.639600
    Name: data2, dtype: float64
    
    grouped.apply(get_status).unstack()
    
    count max mean min
    data1
    (-3.788, -1.89] 34.0 1.935178 -0.147920 -1.843208
    (-1.89, 0.00174] 443.0 2.547017 -0.020845 -2.779875
    (0.00174, 1.893] 492.0 2.729491 -0.066996 -3.854003
    (1.893, 3.784] 31.0 1.533001 -0.070277 -1.639600

    There were equal-length buckets;to compute equal-size buckets based on sample quantiles,use qcut.We will pass labels=False to jsut get quantile numbers:

    grouping=pd.qcut(frame.data1,10,labels=False)
    
    grouped=frame.data2.groupby(grouping)
    
    grouped.size()
    
    data1
    0    100
    1    100
    2    100
    3    100
    4    100
    5    100
    6    100
    7    100
    8    100
    9    100
    dtype: int64
    
    grouped.apply(get_status).unstack()
    
    count max mean min
    data1
    0 100.0 2.547017 -0.013468 -2.357562
    1 100.0 2.462236 -0.042549 -2.698867
    2 100.0 2.546495 0.084219 -2.423998
    3 100.0 2.225054 0.035150 -2.778011
    4 100.0 1.669557 -0.134327 -2.779875
    5 100.0 2.649044 -0.071834 -1.871068
    6 100.0 2.729491 0.036419 -2.505988
    7 100.0 2.014786 -0.009873 -2.671353
    8 100.0 2.043135 -0.193331 -3.854003
    9 100.0 2.491035 -0.184446 -2.033173

    Example:Filling missing values with group-specific values

    When cleaning up missing data,in some cases you will replace data observations using dropna,but in others you may want to impute(fill in) the null(NA) values using a fixed value or some value derived from the data.fillna is the right tool to use:

    s=pd.Series(np.random.randn(6))
    
    s[::2]=np.nan
    
    s
    
    0         NaN
    1   -0.875438
    2         NaN
    3    1.106316
    4         NaN
    5    0.020065
    dtype: float64
    
    s.fillna(s.mean())
    
    0    0.083648
    1   -0.875438
    2    0.083648
    3    1.106316
    4    0.083648
    5    0.020065
    dtype: float64
    

    Suppose you need the fill value to vary by group.One way to do this is to group the data and use apply with a function that calls fillna on each data chunk.

    states=['Ohio','New york','Vermont','Florida','Oregon','Nevada','California','Idaho']
    
    group_key=['East']*4+['West']*4
    
    data=pd.Series(np.random.randn(8),index=states)
    
    data[['Vermont','Nevada','Idaho']]=np.nan
    
    data.groupby(group_key).mean()
    
    East    0.199033
    West   -0.704578
    dtype: float64
    

    We can fill the NA values using the group means like so:

    fill_mean=lambda g : g.fillna(g.mean())
    
    data.groupby(group_key).apply(fill_mean)
    
    Ohio         -0.953675
    New york      0.531946
    Vermont       0.199033
    Florida       1.018829
    Oregon       -0.539930
    Nevada       -0.704578
    California   -0.869226
    Idaho        -0.704578
    dtype: float64
    

    In another case,you might have predefined fill values in your code that vary by group.

    fill_values={'East':0.5,'West':-1}
    
    fill_func=lambda g:g.fillna(fill_values[g.name])
    
    data.groupby(group_key).apply(fill_func)
    
    Ohio         -0.953675
    New york      0.531946
    Vermont       0.500000
    Florida       1.018829
    Oregon       -0.539930
    Nevada       -1.000000
    California   -0.869226
    Idaho        -1.000000
    dtype: float64
    

    Example: Random sampling and permutation

    Suppose you wanted to draw a random sample(with or without replacement) from a large dataset for Monte Carlo simulation purpose or some other application.There are a number of ways to perform the 'draws':Here we use sample method for Series.

    suits=list('HSCD')
    
    card_val=(list(range(1,11))+[10]*3)*4
    
    base_names=['A']+list(range(2,11))+['J','K','Q'];base_names
    
    ['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
    
    cards=[]
    
    for suit in ['H','S','C','D']:
        cards.extend(str(num)+suit for num in base_names)
    
    deck=pd.Series(card_val,index=cards)
    
    deck
    
    AH      1
    2H      2
    3H      3
    4H      4
    5H      5
    6H      6
    7H      7
    8H      8
    9H      9
    10H    10
    JH     10
    KH     10
    QH     10
    AS      1
    2S      2
    3S      3
    4S      4
    5S      5
    6S      6
    7S      7
    8S      8
    9S      9
    10S    10
    JS     10
    KS     10
    QS     10
    AC      1
    2C      2
    3C      3
    4C      4
    5C      5
    6C      6
    7C      7
    8C      8
    9C      9
    10C    10
    JC     10
    KC     10
    QC     10
    AD      1
    2D      2
    3D      3
    4D      4
    5D      5
    6D      6
    7D      7
    8D      8
    9D      9
    10D    10
    JD     10
    KD     10
    QD     10
    dtype: int64
    
    def draw(deck,n=5):
        return deck.sample(n)
    
    draw(deck)
    
    4H     4
    AD     1
    8S     8
    3C     3
    KH    10
    dtype: int64
    

    Suppose you want two random cards from each suit.Because the suit is the last character of each card name,we can group based on this and use apply.

    get_suit=lambda card:card[-1]
    
    deck.groupby(get_suit).apply(draw,n=2)
    
    C  KC     10
       8C      8
    D  5D      5
       8D      8
    H  QH     10
       5H      5
    S  4S      4
       10S    10
    dtype: int64
    

    Alternatively,we could write:

    deck.groupby(get_suit,group_keys=False).apply(draw,2)
    
    8C      8
    KC     10
    8D      8
    9D      9
    8H      8
    QH     10
    10S    10
    4S      4
    dtype: int64
    

    Example:Group weighted average and correlation

    Under the split-apply-combine paradigm of groupby operations between columns in a DataFrame or two Series,such as a group weighted average,are possible.

    df=pd.DataFrame({'category':['a','a','a','a','b','b','b','b'],
                    'data':np.random.randn(8),'weights':np.random.randn(8)});df
    
    category data weights
    0 a 0.902349 1.730949
    1 a 1.287855 0.768677
    2 a -0.751881 -0.571836
    3 a 1.208645 0.229177
    4 b 0.986127 -0.262788
    5 b 1.566069 -0.252579
    6 b 0.285959 0.184662
    7 b 1.059962 1.804769

    The group weighted average by category would then be:

    grouped=df.groupby('category')
    
    get_wavg=lambda g:np.average(g['data'],weights=g['weights'])
    
    list(grouped)
    
    [('a',   category      data   weights
      0        a  0.902349  1.730949
      1        a  1.287855  0.768677
      2        a -0.751881 -0.571836
      3        a  1.208645  0.229177), ('b',   category      data   weights
      4        b  0.986127 -0.262788
      5        b  1.566069 -0.252579
      6        b  0.285959  0.184662
      7        b  1.059962  1.804769)]
    
    grouped.apply(get_wavg)
    
    category
    a    1.510830
    b    0.889441
    dtype: float64
    
    close_px=pd.read_csv('./pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
    
    close_px.head()
    
    AAPL MSFT XOM SPX
    2003-01-02 7.40 21.11 29.22 909.03
    2003-01-03 7.45 21.14 29.24 908.59
    2003-01-06 7.45 21.52 29.96 929.01
    2003-01-07 7.43 21.93 28.95 922.93
    2003-01-08 7.28 21.31 28.83 909.93
    help(pd.read_csv)
    
    Help on function read_csv in module pandas.io.parsers:
    
    read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=False, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=False, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None)
        Read CSV (comma-separated) file into DataFrame
        
        Also supports optionally iterating or breaking of the file
        into chunks.
        
        Additional help can be found in the `online docs for IO Tools
        <http://pandas.pydata.org/pandas-docs/stable/io.html>`_.
        
        Parameters
        ----------
        filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any object with a read() method (such as a file handle or StringIO)
            The string could be a URL. Valid URL schemes include http, ftp, s3, and
            file. For file URLs, a host is expected. For instance, a local file could
            be file ://localhost/path/to/table.csv
        sep : str, default ','
            Delimiter to use. If sep is None, will try to automatically determine
            this. Separators longer than 1 character and different from ``'s+'`` will
            be interpreted as regular expressions, will force use of the python parsing
            engine and will ignore quotes in the data. Regex example: ``'
    	'``
        delimiter : str, default ``None``
            Alternative argument name for sep.
        delim_whitespace : boolean, default False
            Specifies whether or not whitespace (e.g. ``' '`` or ``'    '``) will be
            used as the sep. Equivalent to setting ``sep='s+'``. If this option
            is set to True, nothing should be passed in for the ``delimiter``
            parameter.
        
            .. versionadded:: 0.18.1 support for the Python parser.
        
        header : int or list of ints, default 'infer'
            Row number(s) to use as the column names, and the start of the data.
            Default behavior is as if set to 0 if no ``names`` passed, otherwise
            ``None``. Explicitly pass ``header=0`` to be able to replace existing
            names. The header can be a list of integers that specify row locations for
            a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not
            specified will be skipped (e.g. 2 in this example is skipped). Note that
            this parameter ignores commented lines and empty lines if
            ``skip_blank_lines=True``, so header=0 denotes the first line of data
            rather than the first line of the file.
        names : array-like, default None
            List of column names to use. If file contains no header row, then you
            should explicitly pass header=None. Duplicates in this list are not
            allowed unless mangle_dupe_cols=True, which is the default.
        index_col : int or sequence or False, default None
            Column to use as the row labels of the DataFrame. If a sequence is given, a
            MultiIndex is used. If you have a malformed file with delimiters at the end
            of each line, you might consider index_col=False to force pandas to _not_
            use the first column as the index (row names)
        usecols : array-like, default None
            Return a subset of the columns. All elements in this array must either
            be positional (i.e. integer indices into the document columns) or strings
            that correspond to column names provided either by the user in `names` or
            inferred from the document header row(s). For example, a valid `usecols`
            parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Using this parameter
            results in much faster parsing time and lower memory usage.
        as_recarray : boolean, default False
            DEPRECATED: this argument will be removed in a future version. Please call
            `pd.read_csv(...).to_records()` instead.
        
            Return a NumPy recarray instead of a DataFrame after parsing the data.
            If set to True, this option takes precedence over the `squeeze` parameter.
            In addition, as row indices are not available in such a format, the
            `index_col` parameter will be ignored.
        squeeze : boolean, default False
            If the parsed data only contains one column then return a Series
        prefix : str, default None
            Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
        mangle_dupe_cols : boolean, default True
            Duplicate columns will be specified as 'X.0'...'X.N', rather than
            'X'...'X'. Passing in False will cause data to be overwritten if there
            are duplicate names in the columns.
        dtype : Type name or dict of column -> type, default None
            Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
            (Unsupported with engine='python'). Use `str` or `object` to preserve and
            not interpret dtype.
        engine : {'c', 'python'}, optional
            Parser engine to use. The C engine is faster while the python engine is
            currently more feature-complete.
        converters : dict, default None
            Dict of functions for converting values in certain columns. Keys can either
            be integers or column labels
        true_values : list, default None
            Values to consider as True
        false_values : list, default None
            Values to consider as False
        skipinitialspace : boolean, default False
            Skip spaces after delimiter.
        skiprows : list-like or integer, default None
            Line numbers to skip (0-indexed) or number of lines to skip (int)
            at the start of the file
        skipfooter : int, default 0
            Number of lines at bottom of file to skip (Unsupported with engine='c')
        skip_footer : int, default 0
            DEPRECATED: use the `skipfooter` parameter instead, as they are identical
        nrows : int, default None
            Number of rows of file to read. Useful for reading pieces of large files
        na_values : scalar, str, list-like, or dict, default None
            Additional strings to recognize as NA/NaN. If dict passed, specific
            per-column NA values.  By default the following values are interpreted as
            NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
        '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'nan'`.
        keep_default_na : bool, default True
            If na_values are specified and keep_default_na is False the default NaN
            values are overridden, otherwise they're appended to.
        na_filter : boolean, default True
            Detect missing value markers (empty strings and the value of na_values). In
            data without any NAs, passing na_filter=False can improve the performance
            of reading a large file
        verbose : boolean, default False
            Indicate number of NA values placed in non-numeric columns
        skip_blank_lines : boolean, default True
            If True, skip over blank lines rather than interpreting as NaN values
        parse_dates : boolean or list of ints or names or list of lists or dict, default False
        
            * boolean. If True -> try parsing the index.
            * list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
              each as a separate date column.
            * list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
                a single date column.
            * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
              'foo'
        
            Note: A fast-path exists for iso8601-formatted dates.
        infer_datetime_format : boolean, default False
            If True and parse_dates is enabled, pandas will attempt to infer the format
            of the datetime strings in the columns, and if it can be inferred, switch
            to a faster method of parsing them. In some cases this can increase the
            parsing speed by ~5-10x.
        keep_date_col : boolean, default False
            If True and parse_dates specifies combining multiple columns then
            keep the original columns.
        date_parser : function, default None
            Function to use for converting a sequence of string columns to an array of
            datetime instances. The default uses ``dateutil.parser.parser`` to do the
            conversion. Pandas will try to call date_parser in three different ways,
            advancing to the next if an exception occurs: 1) Pass one or more arrays
            (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the
            string values from the columns defined by parse_dates into a single array
            and pass that; and 3) call date_parser once for each row using one or more
            strings (corresponding to the columns defined by parse_dates) as arguments.
        dayfirst : boolean, default False
            DD/MM format dates, international and European format
        iterator : boolean, default False
            Return TextFileReader object for iteration or getting chunks with
            ``get_chunk()``.
        chunksize : int, default None
            Return TextFileReader object for iteration. `See IO Tools docs for more
            information
            <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_ on
            ``iterator`` and ``chunksize``.
        compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
            For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
            bz2, zip or xz if filepath_or_buffer is a string ending in '.gz', '.bz2',
            '.zip', or 'xz', respectively, and no decompression otherwise. If using
            'zip', the ZIP file must contain only one data file to be read in.
            Set to None for no decompression.
        
            .. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.
        
        thousands : str, default None
            Thousands separator
        decimal : str, default '.'
            Character to recognize as decimal point (e.g. use ',' for European data).
        float_precision : string, default None
            Specifies which converter the C engine should use for floating-point
            values. The options are `None` for the ordinary converter,
            `high` for the high-precision converter, and `round_trip` for the
            round-trip converter.
        lineterminator : str (length 1), default None
            Character to break file into lines. Only valid with C parser.
        quotechar : str (length 1), optional
            The character used to denote the start and end of a quoted item. Quoted
            items can include the delimiter and it will be ignored.
        quoting : int or csv.QUOTE_* instance, default 0
            Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
            QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
        doublequote : boolean, default ``True``
           When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
           whether or not to interpret two consecutive quotechar elements INSIDE a
           field as a single ``quotechar`` element.
        escapechar : str (length 1), default None
            One-character string used to escape delimiter when quoting is QUOTE_NONE.
        comment : str, default None
            Indicates remainder of line should not be parsed. If found at the beginning
            of a line, the line will be ignored altogether. This parameter must be a
            single character. Like empty lines (as long as ``skip_blank_lines=True``),
            fully commented lines are ignored by the parameter `header` but not by
            `skiprows`. For example, if comment='#', parsing '#empty
    a,b,c
    1,2,3'
            with `header=0` will result in 'a,b,c' being
            treated as the header.
        encoding : str, default None
            Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
            standard encodings
            <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
        dialect : str or csv.Dialect instance, default None
            If None defaults to Excel dialect. Ignored if sep longer than 1 char
            See csv.Dialect documentation for more details
        tupleize_cols : boolean, default False
            Leave a list of tuples on columns as is (default is to convert to
            a Multi Index on the columns)
        error_bad_lines : boolean, default True
            Lines with too many fields (e.g. a csv line with too many commas) will by
            default cause an exception to be raised, and no DataFrame will be returned.
            If False, then these "bad lines" will dropped from the DataFrame that is
            returned. (Only valid with C parser)
        warn_bad_lines : boolean, default True
            If error_bad_lines is False, and warn_bad_lines is True, a warning for each
            "bad line" will be output. (Only valid with C parser).
        low_memory : boolean, default True
            Internally process the file in chunks, resulting in lower memory use
            while parsing, but possibly mixed type inference.  To ensure no mixed
            types either set False, or specify the type with the `dtype` parameter.
            Note that the entire file is read into a single DataFrame regardless,
            use the `chunksize` or `iterator` parameter to return the data in chunks.
            (Only valid with C parser)
        buffer_lines : int, default None
            DEPRECATED: this argument will be removed in a future version because its
            value is not respected by the parser
        compact_ints : boolean, default False
            DEPRECATED: this argument will be removed in a future version
        
            If compact_ints is True, then for any column that is of integer dtype,
            the parser will attempt to cast it as the smallest integer dtype possible,
            either signed or unsigned depending on the specification from the
            `use_unsigned` parameter.
        use_unsigned : boolean, default False
            DEPRECATED: this argument will be removed in a future version
        
            If integer columns are being compacted (i.e. `compact_ints=True`), specify
            whether the column should be compacted to the smallest signed or unsigned
            integer dtype.
        memory_map : boolean, default False
            If a filepath is provided for `filepath_or_buffer`, map the file object
            directly onto memory and access the data directly from there. Using this
            option can improve performance because there is no longer any I/O overhead.
        
        Returns
        -------
        result : DataFrame or TextParser
    
    close_px.info()
    
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
    Data columns (total 4 columns):
    AAPL    2214 non-null float64
    MSFT    2214 non-null float64
    XOM     2214 non-null float64
    SPX     2214 non-null float64
    dtypes: float64(4)
    memory usage: 86.5 KB
    
    close_px[-4:]
    
    AAPL MSFT XOM SPX
    2011-10-11 400.29 27.00 76.27 1195.54
    2011-10-12 402.19 26.96 77.16 1207.25
    2011-10-13 408.43 27.18 76.37 1203.66
    2011-10-14 422.00 27.27 78.11 1224.58
    help(pd.DataFrame.info)
    
    Help on function info in module pandas.core.frame:
    
    info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None)
        Concise summary of a DataFrame.
        
        Parameters
        ----------
        verbose : {None, True, False}, optional
            Whether to print the full summary.
            None follows the `display.max_info_columns` setting.
            True or False overrides the `display.max_info_columns` setting.
        buf : writable buffer, defaults to sys.stdout
        max_cols : int, default None
            Determines whether full summary or short summary is printed.
            None follows the `display.max_info_columns` setting.
        memory_usage : boolean/string, default None
            Specifies whether total memory usage of the DataFrame
            elements (including index) should be displayed. None follows
            the `display.memory_usage` setting. True or False overrides
            the `display.memory_usage` setting. A value of 'deep' is equivalent
            of True, with deep introspection. Memory usage is shown in
            human-readable units (base-2 representation).
        null_counts : boolean, default None
            Whether to show the non-null counts
        
            - If None, then only show if the frame is smaller than
              max_info_rows and max_info_columns.
            - If True, always show counts.
            - If False, never show counts.
    
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 8 entries, 0 to 7
    Data columns (total 3 columns):
    category    8 non-null object
    data        8 non-null float64
    weights     8 non-null float64
    dtypes: float64(2), object(1)
    memory usage: 272.0+ bytes
    

    One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns(computed from the percent changes) with SPX. As one way to do this,we first create a function that computes the pairwise correlation of each column with SPX column:

    spx_corr=lambda x:x.corrwith(x['SPX'])
    
    help(pd.DataFrame.pct_change)
    
    Help on function pct_change in module pandas.core.generic:
    
    pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
        Percent change over given number of periods.
        
        Parameters
        ----------
        periods : int, default 1
            Periods to shift for forming percent change
        fill_method : str, default 'pad'
            How to handle NAs before computing percent changes
        limit : int, default None
            The number of consecutive NAs to fill before stopping
        freq : DateOffset, timedelta, or offset alias string, optional
            Increment to use from time series API (e.g. 'M' or BDay())
        
        Returns
        -------
        chg : NDFrame
        
        Notes
        -----
        
        By default, the percentage change is calculated along the stat
        axis: 0, or ``Index``, for ``DataFrame`` and 1, or ``minor`` for
        ``Panel``. You can change this with the ``axis`` keyword argument.
    
    df=pd.DataFrame(np.arange(12).reshape((3,4)));df
    
    0 1 2 3
    0 0 1 2 3
    1 4 5 6 7
    2 8 9 10 11
    df.pct_change()
    
    0 1 2 3
    0 NaN NaN NaN NaN
    1 inf 4.0 2.000000 1.333333
    2 1.000000 0.8 0.666667 0.571429

    Above
    (7-3)/3=1.33333,(6-2)/2=2,(5-1)/1=4,
    (4-0)/0=inf,(8-4)/4=1,(9-5)/5=0.6667,(11-7)/7=0.57143

    rets=close_px.pct_change().dropna()
    
    rets.head()
    
    AAPL MSFT XOM SPX
    2003-01-03 0.006757 0.001421 0.000684 -0.000484
    2003-01-06 0.000000 0.017975 0.024624 0.022474
    2003-01-07 -0.002685 0.019052 -0.033712 -0.006545
    2003-01-08 -0.020188 -0.028272 -0.004145 -0.014086
    2003-01-09 0.008242 0.029094 0.021159 0.019386

    Lastly,we group these percent changed by year,which can be extracted from each row label with a one-line function that returns the year attribute of each datetime laebl:

    get_year=lambda x:x.year
    
    by_year=rets.groupby(get_year) # get_year will be applied to rets.index
    
    list(by_year)
    
    [(2003,                 AAPL      MSFT       XOM       SPX
      2003-01-03  0.006757  0.001421  0.000684 -0.000484
      2003-01-06  0.000000  0.017975  0.024624  0.022474
      2003-01-07 -0.002685  0.019052 -0.033712 -0.006545
      2003-01-08 -0.020188 -0.028272 -0.004145 -0.014086
      2003-01-09  0.008242  0.029094  0.021159  0.019386
      2003-01-10  0.002725  0.001824 -0.013927  0.000000
      2003-01-13 -0.005435  0.008648 -0.004134 -0.001412
      2003-01-14 -0.002732  0.010379  0.008993  0.005830
      2003-01-15 -0.010959 -0.012506 -0.013713 -0.014426
      2003-01-16  0.012465 -0.016282  0.004519 -0.003942
      2003-01-17 -0.035568 -0.070345 -0.010381 -0.014017
      2003-01-21 -0.005674 -0.002473 -0.023077 -0.015702
      2003-01-22 -0.009986 -0.006445 -0.012885 -0.010432
      2003-01-23  0.021614  0.024950 -0.002175  0.010224
      2003-01-24 -0.026798 -0.046251 -0.021439 -0.029233
      2003-01-27  0.024638 -0.013783 -0.026736 -0.016160
      2003-01-28  0.031117 -0.007246  0.026326  0.013050
      2003-01-29  0.024691  0.022419  0.036431  0.006779
      2003-01-30 -0.041499 -0.033656 -0.018293 -0.022849
      2003-01-31  0.002793 -0.015831  0.027768  0.013130
      2003-02-03  0.020891  0.023056  0.013864  0.005399
      2003-02-04 -0.004093 -0.025681  0.000000 -0.014088
      2003-02-05 -0.010959 -0.007531 -0.014376 -0.005435
      2003-02-06  0.000000  0.009756 -0.008538 -0.006449
      2003-02-07 -0.020776 -0.017713 -0.007535 -0.010094
      2003-02-10  0.015559  0.017486  0.007592  0.007569
      2003-02-11  0.000000 -0.019871 -0.007176 -0.008098
      2003-02-12  0.002786  0.000000 -0.019877 -0.012687
      2003-02-13  0.009722  0.011507  0.012906 -0.001600
      2003-02-14  0.009629  0.028169  0.009465  0.021435
      ...              ...       ...       ...       ...
      2003-11-18 -0.034091  0.000000 -0.012604 -0.009084
      2003-11-19  0.000980  0.008024  0.008062  0.008016
      2003-11-20 -0.001959 -0.009950 -0.003665 -0.008432
      2003-11-21 -0.004907  0.000503  0.001003  0.001577
      2003-11-24  0.042406  0.024611  0.006014  0.016227
      2003-11-25 -0.021760 -0.012745  0.007307  0.001720
      2003-11-26  0.001934  0.001986  0.012859  0.004327
      2003-11-28  0.008687  0.009911 -0.002279 -0.000236
      2003-12-01  0.038278  0.005397  0.005546  0.011264
      2003-12-02 -0.007373 -0.007321  0.002920 -0.003271
      2003-12-03 -0.023213  0.000492  0.000971 -0.001772
      2003-12-04  0.004753  0.020639  0.013251  0.004687
      2003-12-05 -0.013245 -0.008185 -0.006380 -0.007684
      2003-12-08  0.008629  0.009709  0.016693  0.007348
      2003-12-09 -0.027567  0.005288  0.006315 -0.008529
      2003-12-10 -0.003910  0.008130  0.004079 -0.001066
      2003-12-11  0.040236  0.000949  0.003125  0.011482
      2003-12-12 -0.015094  0.001422  0.004050  0.002735
      2003-12-15 -0.033525  0.003313 -0.006826 -0.005679
      2003-12-16 -0.002973  0.011792  0.017807  0.006638
      2003-12-17 -0.011928 -0.000466  0.010436  0.001256
      2003-12-18  0.008048  0.013060  0.013670  0.011798
      2003-12-19 -0.016966 -0.001381  0.004795 -0.000477
      2003-12-22  0.008122 -0.006455 -0.003281  0.003931
      2003-12-23 -0.003021 -0.001392  0.007181  0.002818
      2003-12-24  0.030303 -0.003717  0.002080 -0.001807
      2003-12-26  0.018627  0.006063  0.005336  0.001691
      2003-12-29  0.017324  0.009272  0.013270  0.012401
      2003-12-30  0.006623  0.002297  0.002619  0.000144
      2003-12-31  0.004699 -0.005500  0.007837  0.002055
      
      [251 rows x 4 columns]),
     (2004,                 AAPL      MSFT       XOM       SPX
      2004-01-02 -0.004677  0.002765 -0.008929 -0.003094
      2004-01-05  0.042293  0.025276  0.023249  0.012395
      2004-01-06 -0.003607  0.003586 -0.006816  0.001292
      2004-01-07  0.022624 -0.001340 -0.007149  0.002367
      2004-01-08  0.033628 -0.001342 -0.002592  0.004963
      2004-01-09 -0.015411 -0.017913 -0.014727 -0.008888
      2004-01-12  0.031304 -0.003192  0.014068  0.004787
      2004-01-13  0.016863 -0.005032 -0.004335 -0.005332
      2004-01-14  0.003317  0.009655  0.000000  0.008295
      2004-01-15 -0.055372 -0.005920 -0.009869  0.001353
      2004-01-16 -0.006124  0.010078  0.005570  0.006872
      2004-01-20  0.000000  0.010431  0.009329 -0.000930
      2004-01-21 -0.004401  0.007181  0.013287  0.007772
      2004-01-22 -0.019452 -0.010250 -0.003421 -0.003207
      2004-01-23  0.017133  0.016659 -0.006007 -0.002089
      2004-01-26  0.020390  0.011072  0.012374  0.012106
      2004-01-27  0.001738 -0.018835 -0.008528 -0.009798
      2004-01-28 -0.023417 -0.019196 -0.009174 -0.013610
      2004-01-29  0.007105  0.007283  0.016204  0.004989
      2004-01-30 -0.005291 -0.009489 -0.016515 -0.002628
      2004-02-02 -0.010638 -0.009124 -0.002316  0.003651
      2004-02-03 -0.002688 -0.003683 -0.006965  0.000678
      2004-02-04 -0.020665 -0.010628 -0.001461 -0.008371
      2004-02-05  0.028440 -0.001868  0.000585  0.001838
      2004-02-06  0.012489  0.004679  0.003217  0.012555
      2004-02-09 -0.000881 -0.006521  0.011662 -0.002581
      2004-02-10  0.013228  0.004219  0.008934  0.005027
      2004-02-11  0.035683  0.004669  0.020280  0.010667
      2004-02-12 -0.003361 -0.006970 -0.000280 -0.004880
      2004-02-13 -0.030354 -0.013570  0.000000 -0.005468
      ...              ...       ...       ...       ...
      2004-11-18  0.008743 -0.003738  0.000922  0.001362
      2004-11-19 -0.003973 -0.007920  0.006680 -0.011161
      2004-11-22  0.112038 -0.007563  0.010526  0.005896
      2004-11-23 -0.000978 -0.004657  0.005661 -0.000255
      2004-11-24  0.045366  0.004254  0.005629  0.004095
      2004-11-26  0.007805 -0.001694  0.003135  0.000753
      2004-11-29  0.060099  0.006364 -0.005357 -0.003450
      2004-11-30 -0.020164  0.001686 -0.002469 -0.004030
      2004-12-01  0.011035  0.016414 -0.001800  0.014951
      2004-12-02 -0.038348 -0.005797 -0.019157 -0.000873
      2004-12-03 -0.038650  0.004998  0.001838  0.000706
      2004-12-06  0.049458  0.003730 -0.002523 -0.000772
      2004-12-07 -0.044086 -0.009496 -0.008738 -0.011073
      2004-12-08  0.006361  0.010838  0.006495  0.004877
      2004-12-09  0.011378 -0.004948  0.005992  0.005436
      2004-12-10  0.018125 -0.005387 -0.003436 -0.001043
      2004-12-13 -0.003683  0.006250  0.010345  0.008990
      2004-12-14  0.005853 -0.000828 -0.000228  0.003921
      2004-12-15 -0.000613 -0.004144 -0.002959  0.001945
      2004-12-16  0.020533  0.001665 -0.007989 -0.002082
      2004-12-17 -0.024324 -0.007478 -0.005983 -0.007488
      2004-12-20 -0.034780 -0.000419  0.025231  0.000377
      2004-12-21  0.015306  0.004606  0.012192  0.009040
      2004-12-22  0.001256 -0.003752 -0.007584  0.003418
      2004-12-23  0.004078  0.001674  0.013261  0.000463
      2004-12-27 -0.013433 -0.006266 -0.021074 -0.004305
      2004-12-28  0.016149  0.003783  0.003399  0.007154
      2004-12-29  0.004051 -0.001675  0.001581 -0.000074
      2004-12-30  0.005587 -0.005034 -0.002029  0.000082
      2004-12-31 -0.006173 -0.001686  0.004519 -0.001343
      
      [252 rows x 4 columns]),
     (2005,                 AAPL      MSFT       XOM       SPX
      2005-01-03 -0.017081  0.000845 -0.022717 -0.008119
      2005-01-04  0.010111  0.003797 -0.006904 -0.011671
      2005-01-05  0.008758 -0.002522 -0.005098 -0.003628
      2005-01-06  0.000930 -0.000843  0.012579  0.003506
      2005-01-07  0.072491 -0.002952 -0.006441 -0.001431
      2005-01-10 -0.004044  0.004653  0.003705  0.003423
      2005-01-11 -0.063805 -0.002526 -0.002537 -0.006100
      2005-01-12  0.013941  0.001688  0.014801  0.003981
      2005-01-13  0.066300 -0.018963 -0.000684 -0.008630
      2005-01-14  0.005731 -0.005584  0.010262  0.006005
      2005-01-18  0.006553  0.007775  0.008352  0.009675
      2005-01-19 -0.011039 -0.013288 -0.008731 -0.009490
      2005-01-20  0.008300 -0.004344 -0.006549 -0.007783
      2005-01-21  0.000284 -0.008290 -0.005456 -0.006415
      2005-01-24  0.003973  0.000880  0.013714 -0.003528
      2005-01-25  0.018372  0.013626  0.003833  0.004004
      2005-01-26  0.002498 -0.000434  0.007188  0.004844
      2005-01-27  0.005537  0.003905  0.001115  0.000409
      2005-01-28  0.018447  0.002593 -0.009356 -0.002716
      2005-01-31  0.039470  0.003879  0.006521  0.008460
      2005-02-01  0.008062  0.004294  0.032395  0.006891
      2005-02-02  0.027090  0.002565  0.012335  0.003178
      2005-02-03 -0.022859 -0.010661  0.010902 -0.002766
      2005-02-04  0.013368  0.005603  0.014168  0.011043
      2005-02-07  0.001268 -0.006429  0.001877 -0.001089
      2005-02-08  0.024829  0.003020  0.011863  0.000483
      2005-02-09 -0.026700 -0.006452 -0.004319 -0.008575
      2005-02-10 -0.004826 -0.000433  0.014460  0.004211
      2005-02-11  0.036243 -0.003032 -0.004072  0.006926
      2005-02-14  0.042118  0.001303  0.012676  0.000697
      ...              ...       ...       ...       ...
      2005-11-17 -0.006620  0.008441  0.003360  0.009414
      2005-11-18  0.000620  0.003587  0.015166  0.004401
      2005-11-21  0.006196  0.003177  0.019208  0.005271
      2005-11-22  0.024015 -0.008709  0.004950  0.005084
      2005-11-23  0.008870  0.000399  0.003599  0.003473
      2005-11-25  0.033229 -0.005988  0.003964  0.002086
      2005-11-28  0.004615 -0.000402 -0.022749 -0.008508
      2005-11-29 -0.022394 -0.002411 -0.006926  0.000016
      2005-11-30 -0.004112  0.000000 -0.005231 -0.006362
      2005-12-01  0.055736  0.007652  0.022785  0.012157
      2005-12-02  0.014385  0.004396 -0.004760  0.000324
      2005-12-05 -0.011152 -0.005969  0.007461 -0.002363
      2005-12-06  0.031050 -0.005604  0.003038  0.001276
      2005-12-07 -0.001350  0.002013 -0.011359 -0.005009
      2005-12-08  0.001758 -0.002009  0.006894 -0.001217
      2005-12-09  0.003375  0.000805 -0.015595  0.002811
      2005-12-12  0.007803 -0.009654  0.006182  0.000842
      2005-12-13  0.000934 -0.011373  0.003456  0.005554
      2005-12-14 -0.039611 -0.001643  0.013586  0.004190
      2005-12-15  0.002361 -0.006173 -0.006230 -0.001414
      2005-12-16 -0.014824 -0.000828 -0.024126 -0.002848
      2005-12-19  0.003797 -0.002487 -0.006035 -0.005839
      2005-12-20  0.010227  0.000831  0.003917 -0.000238
      2005-12-21  0.019276 -0.004566 -0.005657  0.002517
      2005-12-22  0.007075 -0.005421 -0.008633  0.004221
      2005-12-23 -0.009052  0.002096  0.000000  0.000426
      2005-12-27  0.011997 -0.006695 -0.021571 -0.009553
      2005-12-28 -0.008891 -0.002949  0.006675  0.001297
      2005-12-29 -0.028816 -0.004225 -0.000201 -0.002981
      2005-12-30  0.006158 -0.004667 -0.001206 -0.004887
      
      [252 rows x 4 columns]),
     (2006,                 AAPL      MSFT       XOM       SPX
      2006-01-03  0.039783  0.026428  0.041046  0.016430
      2006-01-04  0.002943  0.004568  0.001739  0.003673
      2006-01-05 -0.007870  0.000827 -0.005016  0.000016
      2006-01-06  0.025813 -0.002891  0.019779  0.009399
      2006-01-09 -0.003277 -0.002071 -0.000570  0.003656
      2006-01-10  0.063248  0.005396  0.007801 -0.000357
      2006-01-11  0.037596  0.010735  0.006796  0.003481
      2006-01-12  0.004648 -0.005310 -0.010501 -0.006274
      2006-01-13  0.015423  0.001643  0.022361  0.001205
      2006-01-17 -0.010282 -0.007380  0.009268 -0.003557
      2006-01-18 -0.026207 -0.005783 -0.013958 -0.003975
      2006-01-19 -0.041823  0.007063  0.013597  0.005564
      2006-01-20 -0.037323 -0.022690 -0.015803 -0.018326
      2006-01-23  0.020765 -0.002111  0.011016  0.001847
      2006-01-24 -0.020986 -0.002961 -0.003878  0.002405
      2006-01-25 -0.024198  0.004667 -0.012236 -0.001721
      2006-01-26 -0.025202  0.003801 -0.004317  0.007235
      2006-01-27 -0.004148  0.048801  0.022243  0.007764
      2006-01-30  0.041233  0.007621  0.029688  0.001145
      2006-01-31  0.006800  0.005175 -0.005731 -0.003976
      2006-02-01 -0.001192 -0.003960 -0.012608  0.001859
      2006-02-02 -0.044020 -0.012724  0.000000 -0.009061
      2006-02-03 -0.003467 -0.005236 -0.009121 -0.005359
      2006-02-06 -0.063326 -0.013360  0.009389  0.000783
      2006-02-07  0.004458 -0.008207 -0.022798 -0.008095
      2006-02-08  0.017899 -0.001241  0.002613  0.008663
      2006-02-09 -0.056096 -0.009114 -0.007818 -0.001478
      2006-02-10  0.036336  0.000836 -0.008068  0.002540
      2006-02-13 -0.038627 -0.011278  0.002837 -0.003260
      2006-02-14  0.045279  0.010139 -0.000943  0.010033
      ...              ...       ...       ...       ...
      2006-11-16  0.018560  0.012075 -0.028457  0.002284
      2006-11-17  0.002803 -0.002610  0.005645  0.001029
      2006-11-20  0.007222  0.016822 -0.008799 -0.000500
      2006-11-21  0.024633  0.001103  0.013162  0.001649
      2006-11-22  0.019300  0.000000 -0.006495  0.002338
      2006-11-24  0.014616 -0.005509 -0.007298 -0.003656
      2006-11-27 -0.022809 -0.009232  0.001225 -0.013555
      2006-11-28  0.025352 -0.002982  0.023405  0.003444
      2006-11-29 -0.000109  0.005981  0.025112  0.009202
      2006-11-30 -0.001525 -0.007061  0.010353  0.000822
      2006-12-01 -0.003709 -0.008234  0.005051 -0.002799
      2006-12-04 -0.002190  0.007170  0.007323  0.008885
      2006-12-05  0.001646 -0.006744  0.003849  0.004002
      2006-12-06 -0.015777 -0.004904 -0.022437 -0.001315
      2006-12-07 -0.031059 -0.004928 -0.007844 -0.003971
      2006-12-08  0.014017  0.019048 -0.002782  0.001812
      2006-12-11  0.005552  0.004860 -0.001909  0.002270
      2006-12-12 -0.029408 -0.003720  0.011768 -0.001047
      2006-12-13  0.033782  0.004108  0.014539  0.001169
      2006-12-14 -0.005615  0.017479  0.017770  0.008689
      2006-12-15 -0.009373  0.004020 -0.018164  0.001122
      2006-12-18 -0.025650 -0.009829 -0.023089 -0.003230
      2006-12-19  0.009828  0.003309  0.019524  0.002158
      2006-12-20 -0.017959  0.003298 -0.011807 -0.001417
      2006-12-21 -0.021944 -0.003652 -0.002768 -0.003674
      2006-12-22 -0.008444 -0.011364 -0.005991 -0.005316
      2006-12-26 -0.008394  0.011865  0.008379  0.004352
      2006-12-27  0.000123  0.001099  0.011953  0.007015
      2006-12-28 -0.007974 -0.001464  0.004898 -0.001479
      2006-12-29  0.049091 -0.004032 -0.009031 -0.004513
      
      [251 rows x 4 columns]),
     (2007,                 AAPL      MSFT       XOM       SPX
      2007-01-03 -0.012258  0.000000 -0.032981 -0.001199
      2007-01-04  0.022196 -0.001472 -0.018699  0.001228
      2007-01-05 -0.007121 -0.005898  0.007165 -0.006085
      2007-01-08  0.004938  0.010011 -0.008022  0.002220
      2007-01-09  0.083070  0.000734 -0.007782 -0.000517
      2007-01-10  0.047856 -0.009905 -0.015224  0.001940
      2007-01-11 -0.012371  0.035198 -0.000156  0.006340
      2007-01-12 -0.012317  0.016464  0.023583  0.004853
      2007-01-16  0.026210 -0.001408 -0.014037  0.000818
      2007-01-17 -0.022142 -0.002116  0.011452 -0.000894
      2007-01-18 -0.061927 -0.003180 -0.006885 -0.002971
      2007-01-19 -0.006399  0.003545  0.021876  0.002895
      2007-01-22 -0.019322 -0.012363 -0.008593 -0.005278
      2007-01-23 -0.012559  0.000358  0.021898  0.003542
      2007-01-24  0.011669  0.011441  0.005506  0.008501
      2007-01-25 -0.005190 -0.020502 -0.018499 -0.011270
      2007-01-26 -0.010087  0.005052  0.001206 -0.001208
      2007-01-29  0.006559 -0.002513 -0.005572 -0.001097
      2007-01-30 -0.004538 -0.001440  0.016356  0.005772
      2007-01-31  0.002104  0.012257 -0.004023  0.006593
      2007-02-01 -0.011548 -0.009615  0.013315  0.005354
      2007-02-02  0.000118 -0.012226  0.006053  0.001694
      2007-02-05 -0.009558 -0.018930  0.001761 -0.000967
      2007-02-06  0.002502 -0.003711 -0.002783  0.000698
      2007-02-07  0.023767 -0.004469 -0.004701  0.001395
      2007-02-08  0.000348 -0.003741  0.009004 -0.001179
      2007-02-09 -0.033767 -0.009763 -0.003218 -0.007077
      2007-02-12  0.019335 -0.001138 -0.008218 -0.003261
      2007-02-13 -0.002121  0.005695  0.011394  0.007597
      2007-02-14  0.007084  0.013590  0.002048  0.007644
      ...              ...       ...       ...       ...
      2007-11-16  0.012721  0.009634  0.007228  0.005230
      2007-11-19 -0.014664 -0.003817 -0.011661 -0.017460
      2007-11-20  0.029887  0.018519  0.044081  0.004486
      2007-11-21 -0.002310 -0.010345 -0.008817 -0.015927
      2007-11-23  0.018283 -0.003484  0.014282  0.016891
      2007-11-26  0.005830 -0.033376 -0.029521 -0.023239
      2007-11-27  0.013156  0.002631  0.008146  0.014930
      2007-11-28  0.030948  0.019351  0.017801  0.028560
      2007-11-29  0.022584 -0.003218  0.007690  0.000477
      2007-11-30 -0.011232  0.000323  0.006401  0.007770
      2007-12-03 -0.018439 -0.020329 -0.003425 -0.005887
      2007-12-04  0.005311 -0.004282 -0.008223 -0.006540
      2007-12-05  0.031645  0.042011  0.020418  0.015190
      2007-12-06  0.023989  0.011746  0.016857  0.015037
      2007-12-07  0.022901 -0.000628  0.000716 -0.001778
      2007-12-10 -0.000463  0.006593  0.005720  0.007510
      2007-12-11 -0.029195 -0.019027 -0.018960 -0.025271
      2007-12-12  0.012305  0.010811  0.018118  0.006050
      2007-12-13  0.005082  0.022020  0.008779  0.001224
      2007-12-14 -0.007507  0.002462 -0.016700 -0.013746
      2007-12-17 -0.031462 -0.026098 -0.014113 -0.015021
      2007-12-18 -0.007701  0.010088  0.016984  0.006280
      2007-12-19  0.000765  0.001561  0.000119 -0.001361
      2007-12-20  0.022335  0.020879  0.007634  0.004900
      2007-12-21  0.035789  0.015263  0.014205  0.016670
      2007-12-24  0.025218  0.014432  0.002451  0.008077
      2007-12-26  0.000755  0.000889  0.012225  0.000809
      2007-12-27 -0.001910 -0.017471 -0.011962 -0.014282
      2007-12-28  0.006345  0.004219  0.014203  0.001504
      2007-12-31 -0.008757 -0.014406 -0.013889 -0.006852
      
      [251 rows x 4 columns]),
     (2008,                 AAPL      MSFT       XOM       SPX
      2008-01-02 -0.016357 -0.010658 -0.001862 -0.014438
      2008-01-03  0.000462  0.004001  0.003382  0.000000
      2008-01-04 -0.076335 -0.027897 -0.018596 -0.024552
      2008-01-07 -0.013385  0.006623 -0.009356  0.003223
      2008-01-08 -0.035972 -0.033521 -0.012791 -0.018352
      2008-01-09  0.047591  0.029822  0.016711  0.013624
      2008-01-10 -0.007692 -0.003148  0.001072  0.007948
      2008-01-11 -0.029940 -0.012314 -0.014753 -0.013595
      2008-01-14  0.035266  0.014066  0.005796  0.010871
      2008-01-15 -0.054480 -0.011349 -0.019930 -0.024925
      2008-01-16 -0.055608 -0.022640 -0.027931 -0.005612
      2008-01-17  0.007830 -0.003589 -0.030246 -0.029093
      2008-01-18  0.002921 -0.002947  0.013905 -0.006045
      2008-01-22 -0.035449 -0.031856 -0.030890 -0.011085
      2008-01-23 -0.106464 -0.001018  0.012035  0.021442
      2008-01-24 -0.024951  0.041426  0.030580  0.010063
      2008-01-25 -0.041224 -0.009455 -0.023967 -0.015872
      2008-01-28  0.000000 -0.006583  0.014291  0.017548
      2008-01-29  0.011768 -0.003645 -0.000640  0.006160
      2008-01-30  0.004865 -0.012305  0.002307 -0.004764
      2008-01-31  0.024058  0.012458  0.004987  0.016772
      2008-02-01 -0.011894 -0.065846  0.002927  0.012237
      2008-02-04 -0.015701 -0.008544 -0.005963 -0.010463
      2008-02-05 -0.017395 -0.037343 -0.038928 -0.031995
      2008-02-06 -0.056895 -0.018650 -0.008234 -0.007624
      2008-02-07 -0.006230 -0.014063  0.009909  0.007886
      2008-02-08  0.034972  0.015420 -0.002254 -0.004204
      2008-02-11  0.031639 -0.012149  0.018472  0.005889
      2008-02-12 -0.035458  0.004612  0.013961  0.007266
      2008-02-13  0.036361  0.021806  0.013126  0.013604
      ...              ...       ...       ...       ...
      2008-11-18  0.020082  0.022185  0.040222  0.009838
      2008-11-19 -0.040262 -0.067824 -0.038106 -0.061156
      2008-11-20 -0.067215 -0.041327 -0.066851 -0.067123
      2008-11-21  0.025966  0.122647  0.106446  0.063248
      2008-11-24  0.125575  0.051379  0.039498  0.064723
      2008-11-25 -0.023131 -0.033951 -0.008685  0.006551
      2008-11-26  0.046256  0.025027  0.035455  0.035328
      2008-11-28 -0.024526 -0.013506 -0.009122  0.009643
      2008-12-01 -0.040358 -0.079516 -0.072849 -0.089295
      2008-12-02  0.039807  0.029176  0.044467  0.039941
      2008-12-03  0.037093  0.037799  0.016947  0.025836
      2008-12-04 -0.046820 -0.038565 -0.033735 -0.029308
      2008-12-05  0.028334  0.040111  0.004347  0.036499
      2008-12-08  0.060851  0.057311  0.039229  0.038387
      2008-12-09  0.003410 -0.019757 -0.017598 -0.023118
      2008-12-10 -0.018489  0.000517  0.023930  0.011894
      2008-12-11 -0.032685 -0.056302 -0.000668 -0.028524
      2008-12-12  0.034421 -0.004379  0.005345  0.007028
      2008-12-15 -0.035820 -0.016493 -0.006115 -0.012686
      2008-12-16  0.007177  0.055897  0.039856  0.051360
      2008-12-17 -0.065703 -0.022234 -0.025080 -0.009593
      2008-12-18  0.003028 -0.018408 -0.050000 -0.021163
      2008-12-19  0.006374 -0.009377 -0.025691  0.002937
      2008-12-22 -0.047333  0.003341 -0.001568 -0.018302
      2008-12-23  0.007464  0.004994  0.002570 -0.009717
      2008-12-24 -0.015513 -0.005522  0.009113  0.005781
      2008-12-26  0.009055 -0.002221  0.018485  0.005356
      2008-12-29  0.009323 -0.008904  0.010806 -0.003873
      2008-12-30 -0.003695  0.020213  0.007264  0.024407
      2008-12-31 -0.010893  0.004953  0.015784  0.014158
      
      [253 rows x 4 columns]),
     (2009,                 AAPL      MSFT       XOM       SPX
      2009-01-02  0.063269  0.046002  0.022773  0.031608
      2009-01-05  0.042204  0.009424 -0.000131 -0.004668
      2009-01-06 -0.016494  0.011411 -0.016374  0.007817
      2009-01-07 -0.021608 -0.060000 -0.025436 -0.030010
      2009-01-08  0.018569  0.031097  0.010659  0.003397
      2009-01-09 -0.022869 -0.029630 -0.019200 -0.021303
      2009-01-12 -0.021197 -0.002726 -0.013234 -0.022564
      2009-01-13 -0.010715  0.018043  0.018022  0.001758
      2009-01-14 -0.027135 -0.037057 -0.036229 -0.033460
      2009-01-15 -0.022852  0.007808  0.020789  0.001329
      2009-01-16 -0.012593  0.024903  0.018831  0.007562
      2009-01-20 -0.050164 -0.062635 -0.023275 -0.052816
      2009-01-21  0.059207  0.048963  0.038968  0.043491
      2009-01-22  0.066763 -0.117518 -0.012952 -0.015162
      2009-01-23  0.000000  0.005600 -0.002460  0.005378
      2009-01-26  0.014486  0.024752  0.007125  0.005553
      2009-01-27  0.012160  0.001812  0.004082  0.010926
      2009-01-28  0.038245  0.021700  0.004201  0.033558
      2009-01-29 -0.012739 -0.025369 -0.028336 -0.033120
      2009-01-30 -0.030860 -0.027845 -0.006805 -0.022789
      2009-02-02  0.015311  0.042964  0.002796 -0.000533
      2009-02-03  0.016064  0.037612  0.018544  0.015834
      2009-02-04  0.006130  0.006904 -0.005886 -0.007489
      2009-02-05  0.031106  0.022286  0.027403  0.016366
      2009-02-06  0.033796  0.032420  0.012063  0.026896
      2009-02-09  0.027978 -0.011370 -0.010727  0.001485
      2009-02-10 -0.045654 -0.032859 -0.042035 -0.049121
      2009-02-11 -0.010324  0.022084 -0.020402  0.007955
      2009-02-12  0.025305  0.002216  0.008559  0.001739
      2009-02-13 -0.001108 -0.008845 -0.008487 -0.009998
      ...              ...       ...       ...       ...
      2009-11-18 -0.005024  0.003814  0.003204 -0.000468
      2009-11-19 -0.026461 -0.011054 -0.008331 -0.013426
      2009-11-20 -0.002942 -0.005239 -0.003500 -0.003215
      2009-11-23  0.029812  0.010885  0.017704  0.013616
      2009-11-24 -0.006994 -0.001042  0.003590 -0.000533
      2009-11-25 -0.001223 -0.004172  0.006603  0.004504
      2009-11-27 -0.017631 -0.019204 -0.021047 -0.017233
      2009-11-30 -0.003390  0.006764  0.002792  0.003793
      2009-12-01 -0.014707  0.020156  0.012808  0.012075
      2009-12-02 -0.003757 -0.007626 -0.003162  0.000343
      2009-12-03  0.001274  0.001746 -0.010618 -0.008402
      2009-12-04 -0.016083  0.004881 -0.009895  0.005509
      2009-12-07 -0.022605 -0.006246 -0.006475 -0.002468
      2009-12-08  0.004869 -0.007332 -0.011051 -0.010252
      2009-12-09  0.041765  0.004573 -0.002292  0.003672
      2009-12-10 -0.006926  0.005602 -0.005313  0.005840
      2009-12-11 -0.008960 -0.000696  0.005919  0.003683
      2009-12-14  0.011866  0.008711 -0.043054  0.006959
      2009-12-15 -0.014265 -0.003109 -0.007499 -0.005547
      2009-12-16  0.004429  0.002772 -0.010728  0.001128
      2009-12-17 -0.016254 -0.016586 -0.003055 -0.011811
      2009-12-18  0.018607  0.025650 -0.000153  0.005830
      2009-12-21  0.014327  0.005139  0.004444  0.010504
      2009-12-22  0.010745  0.009884  0.000915  0.003564
      2009-12-23  0.008684  0.003375 -0.004572  0.002299
      2009-12-24  0.034339  0.002355  0.005818  0.005256
      2009-12-28  0.012294  0.005705  0.006241  0.001154
      2009-12-29 -0.011861  0.007007 -0.003480 -0.001401
      2009-12-30  0.012147 -0.013585 -0.001063  0.000195
      2009-12-31 -0.004300 -0.015788 -0.008511 -0.010050
      
      [252 rows x 4 columns]),
     (2010,                 AAPL      MSFT       XOM       SPX
      2010-01-04  0.015565  0.015700  0.014102  0.016043
      2010-01-05  0.001729  0.000336  0.003930  0.003116
      2010-01-06 -0.015906 -0.006382  0.008582  0.000546
      2010-01-07 -0.001849 -0.010142 -0.003135  0.004001
      2010-01-08  0.006648  0.006831 -0.003893  0.002882
      2010-01-11 -0.008822 -0.012890  0.011124  0.001747
      2010-01-12 -0.011375 -0.006529 -0.004906 -0.009381
      2010-01-13  0.014106  0.009339 -0.004034  0.008326
      2010-01-14 -0.005792  0.020219  0.000150  0.002427
      2010-01-15 -0.016712 -0.003359 -0.008250 -0.010823
      2010-01-19  0.044238  0.007752  0.002420  0.012500
      2010-01-20 -0.015392 -0.016388 -0.017954 -0.010598
      2010-01-21 -0.017286 -0.019041 -0.019511 -0.018945
      2010-01-22 -0.049599 -0.035009 -0.009088 -0.022141
      2010-01-25  0.026903  0.012572 -0.003795  0.004598
      2010-01-26  0.014133  0.006031  0.001111 -0.004203
      2010-01-27  0.009420  0.005994 -0.005708  0.004880
      2010-01-28 -0.041322 -0.017175 -0.008930 -0.011818
      2010-01-29 -0.036279 -0.033880 -0.008045 -0.009829
      2010-02-01  0.013902  0.008121  0.027088  0.014266
      2010-02-02  0.005803  0.001831  0.011845  0.012973
      2010-02-03  0.017206  0.006213 -0.005463 -0.005474
      2010-02-04 -0.036039 -0.027606 -0.028249 -0.031141
      2010-02-05  0.017756  0.006350  0.001292  0.002897
      2010-02-08 -0.006856 -0.010765 -0.000484 -0.008863
      2010-02-09  0.010664  0.010507  0.013232  0.013040
      2010-02-10 -0.005454 -0.000743 -0.005415 -0.002233
      2010-02-11  0.018194  0.004831  0.006085  0.009680
      2010-02-12  0.008607 -0.007027 -0.006844 -0.002745
      2010-02-16  0.015071  0.019739  0.022917  0.018001
      ...              ...       ...       ...       ...
      2010-11-18  0.026389  0.010359  0.018876  0.015357
      2010-11-19 -0.005512 -0.005915  0.003329  0.002540
      2010-11-22  0.021615  0.001587 -0.005049 -0.001575
      2010-11-23 -0.014775 -0.023762 -0.017254 -0.014284
      2010-11-24  0.019661  0.010142  0.013131  0.014923
      2010-11-26  0.000635 -0.004819 -0.009320 -0.007469
      2010-11-29  0.005937  0.002421  0.003234 -0.001379
      2010-11-30 -0.018052 -0.002013  0.001465 -0.006070
      2010-12-01  0.016873  0.031061  0.025457  0.021617
      2010-12-02  0.005531  0.032473  0.002140  0.012818
      2010-12-03 -0.002232  0.004926 -0.003986  0.002603
      2010-12-06  0.008537 -0.006787  0.001572 -0.001298
      2010-12-07 -0.006060  0.001139  0.002141  0.000515
      2010-12-08  0.008799  0.013652  0.005412  0.003702
      2010-12-09 -0.003894 -0.005612  0.002125  0.003843
      2010-12-10  0.002502  0.009406  0.002544  0.006002
      2010-12-13  0.003463 -0.003354  0.002538  0.000048
      2010-12-14 -0.004290  0.013837 -0.002531  0.000911
      2010-12-15  0.000219  0.008115 -0.004652 -0.005122
      2010-12-16  0.002778  0.005123  0.005241  0.006185
      2010-12-17 -0.001992 -0.003276 -0.000705  0.000837
      2010-12-20  0.004990 -0.003287  0.000705  0.002548
      2010-12-21  0.006176  0.009527  0.006904  0.006030
      2010-12-22  0.002961  0.004356  0.001120  0.003380
      2010-12-23 -0.004798  0.003975  0.005451 -0.001644
      2010-12-27  0.003337 -0.008279 -0.002641  0.000613
      2010-12-28  0.002433 -0.002178  0.005715  0.000771
      2010-12-29 -0.000553 -0.001455 -0.000693  0.001009
      2010-12-30 -0.005011 -0.004372 -0.000139 -0.001508
      2010-12-31 -0.003399  0.002195 -0.003329 -0.000191
      
      [252 rows x 4 columns]),
     (2011,                 AAPL      MSFT       XOM       SPX
      2011-01-03  0.021732  0.002556  0.019624  0.011315
      2011-01-04  0.005219  0.004006  0.004641 -0.001313
      2011-01-05  0.008180 -0.003264 -0.002717  0.005007
      2011-01-06 -0.000808  0.029476  0.006540 -0.002123
      2011-01-07  0.007161 -0.007777  0.005414 -0.001845
      2011-01-10  0.018833 -0.013181 -0.006058 -0.001376
      2011-01-11 -0.002365 -0.003971  0.007450  0.003725
      2011-01-12  0.008137  0.015585  0.011697  0.009008
      2011-01-13  0.003658 -0.012491  0.001728 -0.001711
      2011-01-14  0.008100  0.003975  0.014725  0.007385
      2011-01-18 -0.022469  0.012599  0.011243  0.001376
      2011-01-19 -0.005313 -0.006754 -0.006076 -0.010116
      2011-01-20 -0.018180 -0.004295 -0.006243 -0.001295
      2011-01-21 -0.017915 -0.011503  0.015838  0.002414
      2011-01-24  0.032842  0.012727 -0.004896  0.005836
      2011-01-25  0.011705  0.002513  0.001165  0.000263
      2011-01-26  0.007176  0.011819  0.012416  0.004221
      2011-01-27 -0.001861  0.002832  0.002683  0.002244
      2011-01-28 -0.020716 -0.038475 -0.011084 -0.017852
      2011-01-31  0.009580 -0.000734  0.021386  0.007663
      2011-02-01  0.016828  0.009184  0.039985  0.016694
      2011-02-02 -0.002058 -0.001820 -0.005943 -0.002723
      2011-02-03 -0.002556 -0.010212  0.000366  0.002354
      2011-02-04  0.008910  0.004422 -0.001829  0.002884
      2011-02-07  0.015527  0.015407  0.007698  0.006240
      2011-02-08  0.009435  0.002890 -0.006063  0.004185
      2011-02-09  0.008333 -0.011167 -0.005124 -0.002786
      2011-02-10 -0.010107 -0.016758  0.007848  0.000750
      2011-02-11  0.006515 -0.009263 -0.004623  0.005507
      2011-02-14  0.006529 -0.000374  0.025303  0.002385
      ...              ...       ...       ...       ...
      2011-09-02 -0.018319 -0.015643 -0.018370 -0.025282
      2011-09-06  0.015212 -0.011240 -0.013723 -0.007436
      2011-09-07  0.011034  0.019208  0.035137  0.028646
      2011-09-08  0.000547  0.008462 -0.011270 -0.010612
      2011-09-09 -0.017337 -0.018307 -0.024856 -0.026705
      2011-09-12  0.006517  0.005828  0.011688  0.006966
      2011-09-13  0.012318  0.005794 -0.002645  0.009120
      2011-09-14  0.012168  0.017665  0.013817  0.013480
      2011-09-15  0.009401  0.018491  0.018860  0.017187
      2011-09-16  0.019188  0.004817  0.007296  0.005707
      2011-09-19  0.027790  0.003319 -0.011402 -0.009803
      2011-09-20  0.004421 -0.008453  0.004206 -0.001661
      2011-09-21 -0.003168 -0.036694 -0.027564 -0.029390
      2011-09-22 -0.025040 -0.035783 -0.037932 -0.031883
      2011-09-23  0.006172  0.000000  0.001011  0.006082
      2011-09-26 -0.002795  0.015164  0.034771  0.023336
      2011-09-27 -0.009698  0.009041  0.016592  0.010688
      2011-09-28 -0.005635 -0.003506 -0.011521 -0.020691
      2011-09-29 -0.016221 -0.005082  0.025114  0.008114
      2011-09-30 -0.023683 -0.022004 -0.016919 -0.024974
      2011-10-03 -0.017623 -0.014464 -0.020377 -0.028451
      2011-10-04 -0.005606  0.033021  0.023612  0.022488
      2011-10-05  0.015436  0.021705  0.015378  0.017866
      2011-10-06 -0.002327  0.017381 -0.000811  0.018304
      2011-10-07 -0.020060 -0.003417 -0.004466 -0.008163
      2011-10-10  0.051406  0.026286  0.036977  0.034125
      2011-10-11  0.029526  0.002227 -0.000131  0.000544
      2011-10-12  0.004747 -0.001481  0.011669  0.009795
      2011-10-13  0.015515  0.008160 -0.010238 -0.002974
      2011-10-14  0.033225  0.003311  0.022784  0.017380
      
      [199 rows x 4 columns])]
    
    rets.index[0].year
    
    2003
    
    by_year.apply(spx_corr)
    
    AAPL MSFT XOM SPX
    2003 0.541124 0.745174 0.661265 1.0
    2004 0.374283 0.588531 0.557742 1.0
    2005 0.467540 0.562374 0.631010 1.0
    2006 0.428267 0.406126 0.518514 1.0
    2007 0.508118 0.658770 0.786264 1.0
    2008 0.681434 0.804626 0.828303 1.0
    2009 0.707103 0.654902 0.797921 1.0
    2010 0.710105 0.730118 0.839057 1.0
    2011 0.691931 0.800996 0.859975 1.0

    We could also compute inter-column correlations.Here we compute the annual correlation between Apple and Microsoft.

    rets['AAPL'].corr(rets['MSFT'])
    
    0.44469678861732237
    
    df
    
    0 1 2 3
    0 0 1 2 3
    1 4 5 6 7
    2 8 9 10 11
    df[0].corr(df[1])
    
    1.0
    
    help(pd.Series.corr)
    
    Help on function corr in module pandas.core.series:
    
    corr(self, other, method='pearson', min_periods=None)
        Compute correlation with `other` Series, excluding missing values
        
        Parameters
        ----------
        other : Series
        method : {'pearson', 'kendall', 'spearman'}
            * pearson : standard correlation coefficient
            * kendall : Kendall Tau correlation coefficient
            * spearman : Spearman rank correlation
        min_periods : int, optional
            Minimum number of observations needed to have a valid result
        
        
        Returns
        -------
        correlation : float
    

    Example:Group-wise Linear regression

    You can use groupby to perform more complex group-wise statistical analysis,as long as the function returns a pandas object or scalar value.For example,I can define the following regress function(using the statsmodels econometrics libraby),which executes an ordinary least quares(OLS) regression on each chunk of data:

    import statsmodels.api as sm
    
    def regress(data,yvar,xvars):
        Y=data[yvar]
        X=data[xvars]
        X['intercept']=1
        result=sm.OLS(Y,X).fit()
        return result.params
    
    by_year.apply(regress,'AAPL',['SPX'])
    
    SPX intercept
    2003 1.195406 0.000710
    2004 1.363463 0.004201
    2005 1.766415 0.003246
    2006 1.645496 0.000080
    2007 1.198761 0.003438
    2008 0.968016 -0.001110
    2009 0.879103 0.002954
    2010 1.052608 0.001261
    2011 0.806605 0.001514

    Pivot tables and cross-tabulation

    help(pd.DataFrame.pivot_table)
    
    Help on function pivot_table in module pandas.tools.pivot:
    
    pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
        Create a spreadsheet-style pivot table as a DataFrame. The levels in the
        pivot table will be stored in MultiIndex objects (hierarchical indexes) on
        the index and columns of the result DataFrame
        
        Parameters
        ----------
        data : DataFrame
        values : column to aggregate, optional
        index : column, Grouper, array, or list of the previous
            If an array is passed, it must be the same length as the data. The list
            can contain any of the other types (except list).
            Keys to group by on the pivot table index.  If an array is passed, it
            is being used as the same manner as column values.
        columns : column, Grouper, array, or list of the previous
            If an array is passed, it must be the same length as the data. The list
            can contain any of the other types (except list).
            Keys to group by on the pivot table column.  If an array is passed, it
            is being used as the same manner as column values.
        aggfunc : function or list of functions, default numpy.mean
            If list of functions passed, the resulting pivot table will have
            hierarchical columns whose top level are the function names (inferred
            from the function objects themselves)
        fill_value : scalar, default None
            Value to replace missing values with
        margins : boolean, default False
            Add all row / columns (e.g. for subtotal / grand totals)
        dropna : boolean, default True
            Do not include columns whose entries are all NaN
        margins_name : string, default 'All'
            Name of the row / column that will contain the totals
            when margins is True.
        
        Examples
        --------
        >>> df
           A   B   C      D
        0  foo one small  1
        1  foo one large  2
        2  foo one large  2
        3  foo two small  3
        4  foo two small  3
        5  bar one large  4
        6  bar one small  5
        7  bar two small  6
        8  bar two large  7
        
        >>> table = pivot_table(df, values='D', index=['A', 'B'],
        ...                     columns=['C'], aggfunc=np.sum)
        >>> table
                  small  large
        foo  one  1      4
             two  6      NaN
        bar  one  5      4
             two  6      7
        
        Returns
        -------
        table : DataFrame
    
    ex1=pd.read_csv('ex1.csv')
    
    ex1
    
    sex age grade province
    0 m 20 89 H
    1 f 23 90 B
    2 m 21 86 B
    3 m 22 56 H
    4 f 23 78 J
    5 f 22 90 B
    6 m 21 100 H
    7 m 20 98 J
    8 m 25 67 B
    9 f 23 89 J
    ex1.pivot_table(index='sex',aggfunc=[np.mean,np.sum,np.min,np.max],margins=True)
    
    mean sum amin amax
    age grade age grade age grade province age grade province
    sex
    f 22.75 86.750000 91.0 347.0 22 78 B 23 90 J
    m 21.50 82.666667 129.0 496.0 20 56 B 25 100 J
    All 22.00 84.300000 220.0 843.0 20 56 B 25 100 J
    ex1.pivot_table(index='sex')
    
    age grade
    sex
    f 22.75 86.750000
    m 21.50 82.666667

    Attention :min and max will discard non-number column 'province`.

    df=pd.DataFrame(np.arange(4).reshape((2,2)),index=['a','b'],columns=['one','two']);df
    
    one two
    a 0 1
    b 2 3
    df.pivot_table(index='one',aggfunc=np.sum,margins=True)
    
    two
    one
    0 1.0
    2 3.0
    All 4.0
    tips.head()
    
    total_bill tip smoker day time size tip_pct
    0 16.99 1.01 No Sun Dinner 2 0.059447
    1 10.34 1.66 No Sun Dinner 3 0.160542
    2 21.01 3.50 No Sun Dinner 3 0.166587
    3 23.68 3.31 No Sun Dinner 2 0.139780
    4 24.59 3.61 No Sun Dinner 4 0.146808
    tips.pivot_table(index=['day','smoker'])
    
    size tip tip_pct total_bill
    day smoker
    Fri No 2.250000 2.812500 0.151650 18.420000
    Yes 2.066667 2.714000 0.174783 16.813333
    Sat No 2.555556 3.102889 0.158048 19.661778
    Yes 2.476190 2.875476 0.147906 21.276667
    Sun No 2.929825 3.167895 0.160113 20.506667
    Yes 2.578947 3.516842 0.187250 24.120000
    Thur No 2.488889 2.673778 0.160298 17.113111
    Yes 2.352941 3.030000 0.163863 19.190588
    tips.pivot_table(values=['tip_pct','size'],index=['time','day'],columns='smoker')
    
    tip_pct size
    smoker No Yes No Yes
    time day
    Dinner Fri 0.139622 0.165347 2.000000 2.222222
    Sat 0.158048 0.147906 2.555556 2.476190
    Sun 0.160113 0.187250 2.929825 2.578947
    Thur 0.159744 NaN 2.000000 NaN
    Lunch Fri 0.187735 0.188937 3.000000 1.833333
    Thur 0.160311 0.163863 2.500000 2.352941
    tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker') #The same with above
    
    tip_pct size
    smoker No Yes No Yes
    time day
    Dinner Fri 0.139622 0.165347 2.000000 2.222222
    Sat 0.158048 0.147906 2.555556 2.476190
    Sun 0.160113 0.187250 2.929825 2.578947
    Thur 0.159744 NaN 2.000000 NaN
    Lunch Fri 0.187735 0.188937 3.000000 1.833333
    Thur 0.160311 0.163863 2.500000 2.352941

    We could augment this table to include partial totals by passing margins=True.This has the effect of adding All row and column labels,with corresponding values being the group statistics for all the data within a single tier:

    tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)
    
    tip_pct size
    smoker No Yes All No Yes All
    time day
    Dinner Fri 0.139622 0.165347 0.158916 2.000000 2.222222 2.166667
    Sat 0.158048 0.147906 0.153152 2.555556 2.476190 2.517241
    Sun 0.160113 0.187250 0.166897 2.929825 2.578947 2.842105
    Thur 0.159744 NaN 0.159744 2.000000 NaN 2.000000
    Lunch Fri 0.187735 0.188937 0.188765 3.000000 1.833333 2.000000
    Thur 0.160311 0.163863 0.161301 2.500000 2.352941 2.459016
    All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672

    Here,the All values are means without taking into account smoker versus non-smoker or any of the two levels of grouping on the rows.

    if some combinations are empty,you may wish to pass a fill_value.

    tips.pivot_table('tip_pct',index=['time','size','smoker'],columns='day',aggfunc='mean',fill_value=0)
    
    day Fri Sat Sun Thur
    time size smoker
    Dinner 1 No 0.000000 0.137931 0.000000 0.000000
    Yes 0.000000 0.325733 0.000000 0.000000
    2 No 0.139622 0.162705 0.168859 0.159744
    Yes 0.171297 0.148668 0.207893 0.000000
    3 No 0.000000 0.154661 0.152663 0.000000
    Yes 0.000000 0.144995 0.152660 0.000000
    4 No 0.000000 0.150096 0.148143 0.000000
    Yes 0.117750 0.124515 0.193370 0.000000
    5 No 0.000000 0.000000 0.206928 0.000000
    Yes 0.000000 0.106572 0.065660 0.000000
    6 No 0.000000 0.000000 0.103799 0.000000
    Lunch 1 No 0.000000 0.000000 0.000000 0.181728
    Yes 0.223776 0.000000 0.000000 0.000000
    2 No 0.000000 0.000000 0.000000 0.166005
    Yes 0.181969 0.000000 0.000000 0.158843
    3 No 0.187735 0.000000 0.000000 0.084246
    Yes 0.000000 0.000000 0.000000 0.204952
    4 No 0.000000 0.000000 0.000000 0.138919
    Yes 0.000000 0.000000 0.000000 0.155410
    5 No 0.000000 0.000000 0.000000 0.121389
    6 No 0.000000 0.000000 0.000000 0.173706
    data
    
    Ohio         -0.953675
    New york      0.531946
    Vermont            NaN
    Florida       1.018829
    Oregon       -0.539930
    Nevada             NaN
    California   -0.869226
    Idaho              NaN
    dtype: float64
    

    Cross-tabulation:Crosstab

    A cross-tabulation is a special case of a pivot table that computes group frequencies.

    data=pd.read_csv('ex2.csv')
    
    data
    
    Sample Nationality Handeness
    0 1 USA R
    1 2 JAPAN L
    2 3 USA R
    3 4 JAPAN R
    4 5 JAPAN L
    5 6 JAPAN R
    6 7 USA R
    7 8 USA L
    8 9 JAPAN R
    9 10 USA R
    pd.crosstab(data.Nationality,data.Handeness,margins=True)
    
    Handeness L R All
    Nationality
    JAPAN 2 3 5
    USA 1 4 5
    All 3 7 10
    pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
    
    smoker No Yes All
    time day
    Dinner Fri 3 9 12
    Sat 45 42 87
    Sun 57 19 76
    Thur 1 0 1
    Lunch Fri 1 6 7
    Thur 44 17 61
    All 151 93 244
    help(pd.crosstab)
    
    Help on function crosstab in module pandas.tools.pivot:
    
    crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)
        Compute a simple cross-tabulation of two (or more) factors. By default
        computes a frequency table of the factors unless an array of values and an
        aggregation function are passed
        
        Parameters
        ----------
        index : array-like, Series, or list of arrays/Series
            Values to group by in the rows
        columns : array-like, Series, or list of arrays/Series
            Values to group by in the columns
        values : array-like, optional
            Array of values to aggregate according to the factors.
            Requires `aggfunc` be specified.
        aggfunc : function, optional
            If specified, requires `values` be specified as well
        rownames : sequence, default None
            If passed, must match number of row arrays passed
        colnames : sequence, default None
            If passed, must match number of column arrays passed
        margins : boolean, default False
            Add row/column margins (subtotals)
        dropna : boolean, default True
            Do not include columns whose entries are all NaN
        normalize : boolean, {'all', 'index', 'columns'}, or {0,1}, default False
            Normalize by dividing all values by the sum of values.
        
            - If passed 'all' or `True`, will normalize over all values.
            - If passed 'index' will normalize over each row.
            - If passed 'columns' will normalize over each column.
            - If margins is `True`, will also normalize margin values.
        
            .. versionadded:: 0.18.1
        
        
        Notes
        -----
        Any Series passed will have their name attributes used unless row or column
        names for the cross-tabulation are specified.
        
        Any input passed containing Categorical data will have **all** of its
        categories included in the cross-tabulation, even if the actual data does
        not contain any instances of a particular category.
        
        In the event that there aren't overlapping indexes an empty DataFrame will
        be returned.
        
        Examples
        --------
        >>> a
        array([foo, foo, foo, foo, bar, bar,
               bar, bar, foo, foo, foo], dtype=object)
        >>> b
        array([one, one, one, two, one, one,
               one, two, two, two, one], dtype=object)
        >>> c
        array([dull, dull, shiny, dull, dull, shiny,
               shiny, dull, shiny, shiny, shiny], dtype=object)
        
        >>> crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
        b    one          two
        c    dull  shiny  dull  shiny
        a
        bar  1     2      1     0
        foo  2     2      1     2
        
        >>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
        >>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
        >>> crosstab(foo, bar)  # 'c' and 'f' are not represented in the data,
                                # but they still will be counted in the output
        col_0  d  e  f
        row_0
        a      1  0  0
        b      0  1  0
        c      0  0  0
        
        Returns
        -------
        crosstab : DataFrame
    
    
    
    ##### 愿你一寸一寸地攻城略地,一点一点地焕然一新 #####
  • 相关阅读:
    Android:异步处理之Handler、Looper、MessageQueue之间的恩怨(三)
    Android:异步处理之AsyncTask的应用(二)
    Android:异步处理之Handler+Thread的应用(一)
    Android:剖析源码,随心所欲控制Toast显示
    Android:谈一谈安卓应用中的Toast情节(基础)
    Android:一个高效的UI才是一个拉风的UI(二)
    Android:一个高效的UI才是一个拉风的UI(一)
    Intent的那些事儿
    解析BroadcastReceiver之你需要了解的一些东东
    解析Service之你需要了解的一些东东
  • 原文地址:https://www.cnblogs.com/johnyang/p/12793520.html
Copyright © 2011-2022 走看看