zoukankan      html  css  js  c++  java
  • pandas

    import pandas as pd 
    import numpy as np
    

    分割-apply-聚合

    • 大数据的MapReduce

    The most general-purpose GroupBy method is apply, which is the subject of the rest of this section. As illustrated in Figure 10-2, apply splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces together.

    Returning to the tipping dataset from before, suppose you wanted to select the top five tip_pct values by group. First, write a function that selects the rows with the largest values in a particular column:

    tips = pd.read_csv('../examples/tips.csv')
    
    tips.head(2)
    
    total_bill tip smoker day time size
    0 16.99 1.01 No Sun Dinner 2
    1 10.34 1.66 No Sun Dinner 3
    tips['tip_pct'] = tips['tip'] / tips['total_bill']
    
    def top(df, n=5, column='tip_pct'):
        """返回某列排序后后第n个元素"""
        return df.sort_values(by=column)[-n:]
    
    top(tips, n=6)
    
    total_bill tip smoker day time size tip_pct
    109 14.31 4.00 Yes Sat Dinner 2 0.279525
    183 23.17 6.50 Yes Sun Dinner 4 0.280535
    232 11.61 3.39 No Sat Dinner 2 0.291990
    67 3.07 1.00 Yes Sat Dinner 1 0.325733
    178 9.60 4.00 Yes Sun Dinner 2 0.416667
    172 7.25 5.15 Yes Sun Dinner 2 0.710345

    Now, if we group by smoker, say, and call apply with this function, we get the following:

    "先按smoker分组, 然后组内调用top方法"
    tips.groupby('smoker').apply(top)
    
    '先按smoker分组, 然后组内调用top方法'
    
    total_bill tip smoker day time size tip_pct
    smoker
    No 88 24.71 5.85 No Thur Lunch 2 0.236746
    185 20.69 5.00 No Sun Dinner 5 0.241663
    51 10.29 2.60 No Sun Dinner 2 0.252672
    149 7.51 2.00 No Thur Lunch 2 0.266312
    232 11.61 3.39 No Sat Dinner 2 0.291990
    Yes 109 14.31 4.00 Yes Sat Dinner 2 0.279525
    183 23.17 6.50 Yes Sun Dinner 4 0.280535
    67 3.07 1.00 Yes Sat Dinner 1 0.325733
    178 9.60 4.00 Yes Sun Dinner 2 0.416667
    172 7.25 5.15 Yes Sun Dinner 2 0.710345

    What has happened here? The top function is called on each row(类似RDD) group from the DataFrame, and then the results are glued together using pandas.concat, labeling the pieces with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame.

    If you pass a function to apply that takes other arguments or keywords, you can pass these after the function:

    tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')
    
    total_bill tip smoker day time size tip_pct
    smoker day
    No Fri 94 22.75 3.25 No Fri Dinner 2 0.142857
    Sat 212 48.33 9.00 No Sat Dinner 4 0.186220
    Sun 156 48.17 5.00 No Sun Dinner 6 0.103799
    Thur 142 41.19 5.00 No Thur Lunch 5 0.121389
    Yes Fri 95 40.17 4.73 Yes Fri Dinner 4 0.117750
    Sat 170 50.81 10.00 Yes Sat Dinner 3 0.196812
    Sun 182 45.35 3.50 Yes Sun Dinner 3 0.077178
    Thur 197 43.11 5.00 Yes Thur Lunch 4 0.115982

    Beyound these basic usage mechanics, getting the most out of apply may require some creativity. What occurs inside the function passed is up to you; it only needs to only return a pandas object or a scalar value. The rest of this chapter will mainly consist of examples showing you how to solve various using groupby.

    可以自定义各种函数, 只要返回的是df, 然后, 又可以各种groupby..

    You may recall that I earlier called describe on a GroupBy object:

    result = tips.groupby('smoker')['tip_pct'].describe()
    
    result
    
    count mean std min 25% 50% 75% max
    smoker
    No 151.0 0.159328 0.039910 0.056797 0.136906 0.155625 0.185014 0.291990
    Yes 93.0 0.163196 0.085119 0.035638 0.106771 0.153846 0.195059 0.710345
    result.unstack('smoker')
    
           smoker
    count  No        151.000000
           Yes        93.000000
    mean   No          0.159328
           Yes         0.163196
    std    No          0.039910
           Yes         0.085119
    min    No          0.056797
           Yes         0.035638
    25%    No          0.136906
           Yes         0.106771
    50%    No          0.155625
           Yes         0.153846
    75%    No          0.185014
           Yes         0.195059
    max    No          0.291990
           Yes         0.710345
    dtype: float64
    

    Inside GroupBy, when you invoke a method like describe, it's actually just a shortcut for:

    f = lambda x: x.describe()
    
    grouped.apply(f)
    

    过滤分组键

    • group_keys=False

    In the preceding examples, you see that the resulting object has a hierarchical index formed from the group keys along with the indexes of each piece of the original object. You can disable this by passing group_keys=False to groupby.

    tips.groupby('smoker', group_keys=False).apply(top)
    
    total_bill tip smoker day time size tip_pct
    88 24.71 5.85 No Thur Lunch 2 0.236746
    185 20.69 5.00 No Sun Dinner 5 0.241663
    51 10.29 2.60 No Sun Dinner 2 0.252672
    149 7.51 2.00 No Thur Lunch 2 0.266312
    232 11.61 3.39 No Sat Dinner 2 0.291990
    109 14.31 4.00 Yes Sat Dinner 2 0.279525
    183 23.17 6.50 Yes Sun Dinner 4 0.280535
    67 3.07 1.00 Yes Sat Dinner 1 0.325733
    178 9.60 4.00 Yes Sun Dinner 2 0.416667
    172 7.25 5.15 Yes Sun Dinner 2 0.710345

    分位数和桶分析

    • cut, qcut

    As you may recall from Chapter8, pandas has some tool, in particular cut and qcut, for slicing data up into buckets with bins of your choosing or by sample quantiles. Combineing these functions with groupby makes it convenient to perform bucket or quantile analysis on a dataset. Consider a simple random dataset and equal-length bucket categorization using cut:

    frame = pd.DataFrame({
        'data1': np.random.randn(1000),
        'data2': np.random.randn(1000)
    })
    
    quartiles = pd.cut(frame.data1, 4)
    
    quartiles[:10]
    
    0    (-1.672, 0.361]
    1    (-1.672, 0.361]
    2    (-1.672, 0.361]
    3    (-1.672, 0.361]
    4     (0.361, 2.395]
    5    (-1.672, 0.361]
    6    (-1.672, 0.361]
    7     (0.361, 2.395]
    8    (-1.672, 0.361]
    9     (0.361, 2.395]
    Name: data1, dtype: category
    Categories (4, interval[float64]): [(-3.714, -1.672] < (-1.672, 0.361] < (0.361, 2.395] < (2.395, 4.429]]
    

    The Categorical object returned by cut can be passed directly to groupby. So we could compute a set of statistics for the data2 column like so:

    def get_stats(group):
        return {'min': group.min(), 'max': group.max,
               'count': group.count(), 'mean': group.mean()}
    
    
    grouped = frame.data2.groupby(quartiles)
    
    grouped.apply(get_stats).unstack()
    
    count max mean min
    data1
    (-3.714, -1.672] 49 <bound method Series.max of 25 -0.372893 2... -0.2432 -2.16709
    (-1.672, 0.361] 601 <bound method Series.max of 0 0.861588 1... -0.0253114 -2.90659
    (0.361, 2.395] 340 <bound method Series.max of 4 0.228388 7... 0.024466 -3.14779
    (2.395, 4.429] 10 <bound method Series.max of 201 -0.519746 4... -0.267874 -0.835444

    Theses were equal-length buckets; to compute equal-size buckets based on sample quantiles, use qcut.(等长度的'桶'), I'll pass lable=false to just get quantile numbers:

    grouping = pd.qcut(frame.data1, 10, labels=False)
    
    grouped = frame.data2.groupby(grouping)
    
    grouped.apply(get_stats).unstack()
    
    count max mean min
    data1
    0 100 <bound method Series.max of 11 2.804563 2... -0.069347 -2.25593
    1 100 <bound method Series.max of 1 -0.195015 2... -0.0408363 -2.75307
    2 100 <bound method Series.max of 6 -1.087337 1... -0.212456 -2.88498
    3 100 <bound method Series.max of 5 0.120671 1... 0.0688246 -2.82311
    4 100 <bound method Series.max of 22 0.058132 3... 0.0401668 -2.69601
    5 100 <bound method Series.max of 0 0.861588 3... -0.12863 -2.90659
    6 100 <bound method Series.max of 47 0.543961 5... 0.108924 -3.14779
    7 100 <bound method Series.max of 4 0.228388 7... 0.0391474 -1.8324
    8 100 <bound method Series.max of 9 0.303886 1... -0.00849982 -2.19997
    9 100 <bound method Series.max of 23 0.246278 3... -0.0121871 -2.40748

    Example 缺失值填充

    When cleaning up missing data, in some cases you will replace data observations using dropna, but in others you may want to impute(归咎于) (fill in) the null(NA) values using a fixed value or some value derived(派生) from the data(cj.随机森林预测). fillna is the right tool to use; for example, here i fill in NA values with the mean.

    s = pd.Series(np.random.randn(6))
    
    s[::2] = np.nan  # 每个就na
    
    s
    
    0         NaN
    1   -0.661528
    2         NaN
    3    0.144512
    4         NaN
    5    1.096004
    dtype: float64
    
    "用均值填充"
    s.fillna(s.mean())
    
    '用均值填充'
    
    
    
    
    
    
    0    0.192996
    1   -0.661528
    2    0.192996
    3    0.144512
    4    0.192996
    5    1.096004
    dtype: float64
    

    Suppose you need the fill value to vary(变化) by group. One way to do this is to group the data and use apply with a function that calls fillna on each data chunk. Here is some sample data on US states divided into eastern and western regions:

    states = ['Ohio', 'New York', 'Vermont', 'Florida',
        'Oregon', 'Nevada', 'California', 'Idaho']
    
    group_key = ['East'] * 4 + ['West'] * 4 
    
    data = pd.Series(np.random.randn(8), index=states)
    
    data
    
    Ohio          0.508352
    New York     -1.029373
    Vermont      -0.506223
    Florida      -0.128709
    Oregon        0.445320
    Nevada        2.064584
    California   -0.795793
    Idaho        -1.115522
    dtype: float64
    

    Note that the syntax ['East'] * 4 produces a list containing four copies of the elements in ['East
    ']. Adding lists together concatenates them.

    Let's set some values in the data to be missing:

    data[['Vermont', 'Nevada', 'Idaho']] = np.nan 
    
    data
    
    Ohio          0.508352
    New York     -1.029373
    Vermont            NaN
    Florida      -0.128709
    Oregon        0.445320
    Nevada             NaN
    California   -0.795793
    Idaho              NaN
    dtype: float64
    
    data.groupby(group_key).mean()  # 默认忽略缺失值
    
    East   -0.216577
    West   -0.175236
    dtype: float64
    

    We can fill the NA values using the group means like so:

    fill_mean = lambda g: g.fillna(g.mean())
    
    data.groupby(group_key).apply(fill_mean)
    
    Ohio          0.508352
    New York     -1.029373
    Vermont      -0.216577
    Florida      -0.128709
    Oregon        0.445320
    Nevada       -0.175236
    California   -0.795793
    Idaho        -0.175236
    dtype: float64
    

    In another case, you might have predifined fill values in your code that vary by group. Since the groups have a name attribute set internallh, we can use that:

    fill_values = {'East': 0.5, 'West': -1}
    
    fill_func = lambda g: g.fillna(fill_values[g.name])
    
    data.groupby(group_key).apply(fill_func)
    
    Ohio          0.508352
    New York     -1.029373
    Vermont       0.500000
    Florida      -0.128709
    Oregon        0.445320
    Nevada       -1.000000
    California   -0.795793
    Idaho        -1.000000
    dtype: float64
    

    Example: 随机采样

    Suppose you wanted to draw a random sample(with or without replacement) from a large dataset for Monte Calo(蒙特卡洛) simulation purposes or some other application. There are a number of ways to perform the "draws"; here we use the sample method for Series.

    To demonstrate, here's a way to construct a deck of English-style playing cards:

    # Hearts, Spades, Clubs, Diamonds
    
    suits = 'H S C D'.split()
    
    card_val = (list(range(1, 11)) + [10]*3) * 4
    
    base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
    
    cards = []
    for suit in ['H', 'S', 'C', 'D']:
        cards.extend(str(num) + suit for num in base_names)
        
    deck = pd.Series(card_val, index=cards)
    

    So now we have a Series of lenght 52 whose index contains card names and values are the ones used in Blackjack and other games

    deck[:13]
    
    AH      1
    2H      2
    3H      3
    4H      4
    5H      5
    6H      6
    7H      7
    8H      8
    9H      9
    10H    10
    JH     10
    KH     10
    QH     10
    dtype: int64
    

    Now, based on what i said before, drawing a hand of five cards from the deck could be written as:

    def draw(deck, n=5):
        return deck.sample(n)
    
    draw(deck)
    
    3H     3
    5C     5
    JD    10
    4H     4
    JH    10
    dtype: int64
    

    Suppose you wanted two random cards from each suit. Because the suit is the last character of each card name, we can group based on this and use apply:

    get_suit = lambda card: card[-1]  # last letter is suit
    
    deck.groupby(get_suit).apply(draw, n=2)
    
    C  3C      3
       8C      8
    D  4D      4
       7D      7
    H  4H      4
       3H      3
    S  2S      2
       10S    10
    dtype: int64
    

    Alternatively, we could write:

    deck.groupby(get_suit, group_keys=False).apply(draw, n=2)
    
    KC     10
    3C      3
    9D      9
    KD     10
    9H      9
    6H      6
    10S    10
    7S      7
    dtype: int64
    

    Example: 加权平均和相关

    Under the split-combine paradigm of groupby, operations between columns in a DataFrame or two Series, such as a group weighted average, are posible. As an example, take this dataset containing group keys, values, and some weights:

    df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
        'b', 'b', 'b', 'b'],
        'data': np.random.randn(8),
        'weights': np.random.rand(8)})
    
    df
    
    category data weights
    0 a 0.434777 0.486455
    1 a -2.414575 0.374778
    2 a -0.682643 0.651142
    3 a 0.538472 0.238194
    4 b 1.001960 0.724147
    5 b -2.006634 0.770404
    6 b 0.162167 0.262188
    7 b 0.924946 0.723322

    The group weighted average by category would then be:

    grouped = df.groupby('category')
    get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
    grouped.apply(get_wavg)
    
    category
    a   -0.576765
    b   -0.043870
    dtype: float64
    

    As another example, consider a financial dataset originally obtained from Yahoo! Finance containing end-of-day prices for a few stocks and the S&P 500 index.

    close_px = pd.read_csv('../examples/stock_px_2.csv', 
                           parse_dates=True, index_col=0)
    
    close_px.info()
    
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
    Data columns (total 4 columns):
    AAPL    2214 non-null float64
    MSFT    2214 non-null float64
    XOM     2214 non-null float64
    SPX     2214 non-null float64
    dtypes: float64(4)
    memory usage: 86.5 KB
    
    close_px[-4:]  # 选取后4条记录
    
    AAPL MSFT XOM SPX
    2011-10-11 400.29 27.00 76.27 1195.54
    2011-10-12 402.19 26.96 77.16 1207.25
    2011-10-13 408.43 27.18 76.37 1203.66
    2011-10-14 422.00 27.27 78.11 1224.58

    One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns with SPX. As one way to do this, we first create a function that computes the pairwise correlation of each column with the 'SPX' column:

    spx_corr = lambda x: x.corrwith(x['SPX'])
    

    Next, we compute percent change on close_px using pct_change:

    rets = close_px.pct_change().dropna()
    

    Lastly, we group these percent changes by year, which can be extracted from each row label with a one-line function that returns the year attribute of each datetime label:

    get_year = lambda x: x.year  
    
    by_year = rets.groupby(get_year)  # 函数作为分组的 key
    
    by_year.apply(spx_corr)
    
    AAPL MSFT XOM SPX
    2003 0.541124 0.745174 0.661265 1.0
    2004 0.374283 0.588531 0.557742 1.0
    2005 0.467540 0.562374 0.631010 1.0
    2006 0.428267 0.406126 0.518514 1.0
    2007 0.508118 0.658770 0.786264 1.0
    2008 0.681434 0.804626 0.828303 1.0
    2009 0.707103 0.654902 0.797921 1.0
    2010 0.710105 0.730118 0.839057 1.0
    2011 0.691931 0.800996 0.859975 1.0

    You could also compute inter-column correlations. Here we compute the annual correlation between Apple and Microsoft:

    by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))
    
    2003    0.480868
    2004    0.259024
    2005    0.300093
    2006    0.161735
    2007    0.417738
    2008    0.611901
    2009    0.432738
    2010    0.571946
    2011    0.581987
    dtype: float64
    

    Example: 线性回归

    In the same theme as the previous example, you can use groupby to perform more complex group-wise statistical analysis, as long as the function returns a pandas object or scalar value.
    For example, i can define the following regress function, which executes an ordinary least squares(OLS) regression on each chunk of data:

    import statsmodels.api as sm
    
    def regress(data, yvar, xvars):
        """最小二乘"""
        Y = data[yvar]
        X = data[xvars]
        
        X['intercept'] = 1 
        result = sm.OLS(Y, X).fit()
        
        return result.params
    

    Now, to run a yearly linear regression of AAPL on SPX return , execute:

    %time by_year.apply(regress, 'AAPL', ['SPX'])
    
    Wall time: 277 ms
    
    SPX intercept
    2003 1.195406 0.000710
    2004 1.363463 0.004201
    2005 1.766415 0.003246
    2006 1.645496 0.000080
    2007 1.198761 0.003438
    2008 0.968016 -0.001110
    2009 0.879103 0.002954
    2010 1.052608 0.001261
    2011 0.806605 0.001514
  • 相关阅读:
    SpringMVC扩展
    反射机制
    python day9
    python day8
    python day7
    python day6
    python day4
    python day3
    python day2
    python day1
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/12018895.html
Copyright © 2011-2022 走看看