zoukankan      html  css  js  c++  java
  • Advanced pandas

    Advanced pandas

    import numpy as np
    
    import pandas as pd
    

    Categorical Data

    This section introduces the pandas Categorical type.Using it will achieve better performance and memory use in some pandas operations.

    Background and motivation

    Frequently,a column in a table may contain repeated instances of a smaller set of distinct values. We have already seen functions like unique and value_counts,which enable us to extract the distinct values from an array and compute their freuencies respectively.

    values=pd.Series(['apple','orange','apple','apple']*2)
    
    values
    
    0     apple
    1    orange
    2     apple
    3     apple
    4     apple
    5    orange
    6     apple
    7     apple
    dtype: object
    
    pd.unique(values)
    
    array(['apple', 'orange'], dtype=object)
    
    pd.value_counts(values)
    
    apple     6
    orange    2
    dtype: int64
    

    Many data systems have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best pratice is to use so-called dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension tatble.

    values=pd.Series([0,1,0,0]*2)
    
    dim=pd.Series(['apple','orange'])
    
    dim.take(values)
    
    0     apple
    1    orange
    0     apple
    0     apple
    0     apple
    1    orange
    0     apple
    0     apple
    dtype: object
    

    We can use the take method to restore the original Series of strings.

    This representation as integers is called the categorical or dictionary-encoded representation.The array of distinct values can be called the categories,dictionary or levels of the data.In this blog, we will use the terms categorical and categories.The integer values that reference the categories are called the category codes or simply codes.

    The categorical representation can yield significant performance improvements when you are doing analytics.You can also perform transformations on the categories while leaving the codes unmodified.Some example transformations that can be made at relatively low cost are:

    • Renaming categories.
    • Appending a new category without changing the order or position of the existing categories.

    Categorical type in pandas

    Pandas has a special Categorical type for holding data that uses the integer-based categorical representation or encoding.

    fruits=['apple','orange','apple','apple']*2
    
    N=len(fruits)
    
    df=pd.DataFrame({'fruit':fruits,
                    'basket_id':np.arange(N),
                    'count':np.random.randint(3,15,size=N),
                    'weight':np.random.uniform(0,4,size=N)},columns=['basket_id','fruit','count','weight'])
    
    df
    
    basket_id fruit count weight
    0 0 apple 3 3.224342
    1 1 orange 8 2.625838
    2 2 apple 4 1.285304
    3 3 apple 11 1.510722
    4 4 apple 4 1.560894
    5 5 orange 13 3.138222
    6 6 apple 12 3.994037
    7 7 apple 10 0.644615

    df['fruit'] is an array of Python string objects.We can convert it to categorical by calling:

    fruit_cat=df['fruit'].astype('category')
    
    fruit_cat
    
    0     apple
    1    orange
    2     apple
    3     apple
    4     apple
    5    orange
    6     apple
    7     apple
    Name: fruit, dtype: category
    Categories (2, object): [apple, orange]
    
    df['fruit']
    
    0     apple
    1    orange
    2     apple
    3     apple
    4     apple
    5    orange
    6     apple
    7     apple
    Name: fruit, dtype: object
    

    The values for fruit_cat are not a Numpy array,but an instance of pandas.Categorical:

    c=fruit_cat.values
    
    c
    
    [apple, orange, apple, apple, apple, orange, apple, apple]
    Categories (2, object): [apple, orange]
    
    type(c)
    
    pandas.core.arrays.categorical.Categorical
    
    df['fruit'].values  # df['fruit']'s  values are Numpy array.
    
    array(['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple',
           'apple'], dtype=object)
    
    df.values # df.values are also Numpy array
    
    array([[0, 'apple', 3, 3.224342140640482],
           [1, 'orange', 8, 2.625837611614019],
           [2, 'apple', 4, 1.2853036490436986],
           [3, 'apple', 11, 1.5107221991867759],
           [4, 'apple', 4, 1.5608944958834634],
           [5, 'orange', 13, 3.1382222188914577],
           [6, 'apple', 12, 3.9940366021092872],
           [7, 'apple', 10, 0.64461515110633]], dtype=object)
    

    The Categorical object has categories and codes attributes:

    c.categories
    
    Index(['apple', 'orange'], dtype='object')
    
    c.codes
    
    array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)
    
    c.categories.take(c.codes)
    
    Index(['apple', 'orange', 'apple', 'apple', 'apple', 'orange', 'apple',
           'apple'],
          dtype='object')
    
    • You can also create pandas.Categorical directly from other types of Python sequence:
    my_categories=pd.Categorical(['foo','bar','baz','foo','bar'])
    
    my_categories
    
    [foo, bar, baz, foo, bar]
    Categories (3, object): [bar, baz, foo]
    
    • If you have obtained categorical encoded data from another source,you can use the alternative from_codes constructor:
    categories=['foo','bar','baz','baz']
    
    codes=[0,1,2,0,0,1,3]
    
    my_cat_2=pd.Categorical.from_codes(codes,categories)  # categories have to be unique
    
    ---------------------------------------------------------------------------
    
    ValueError                                Traceback (most recent call last)
    
    <ipython-input-29-5368c7cbd53a> in <module>()
    ----> 1 my_cat_2=pd.Categorical.from_codes(codes,categories)  # categories have to be unique
    
    
    D:Anacondalibsite-packagespandascorearrayscategorical.py in from_codes(cls, codes, categories, ordered)
        584                 "codes need to be convertible to an arrays of integers")
        585 
    --> 586         categories = CategoricalDtype.validate_categories(categories)
        587 
        588         if len(codes) and (codes.max() >= len(categories) or codes.min() < -1):
    
    
    D:Anacondalibsite-packagespandascoredtypesdtypes.py in validate_categories(categories, fastpath)
        322 
        323             if not categories.is_unique:
    --> 324                 raise ValueError('Categorical categories must be unique')
        325 
        326         if isinstance(categories, ABCCategoricalIndex):
    
    
    ValueError: Categorical categories must be unique
    
    categories=['foo','bar','baz','baxz']
    
    codes=[0,1,2,0,0,1,3]
    
    pd.Categorical.from_codes(codes,categories)
    
    [foo, bar, baz, foo, foo, bar, baxz]
    Categories (4, object): [foo, bar, baz, baxz]
    

    Unless explicitly specified,categorical conversion assume no specific ordering of the categories,so the categories array may be in a different order depending on the ordering of the input data.When using from_codes or any of the other constructors,you can indicate that the categories have a meaningful ordering.

    ordered_cat=pd.Categorical.from_codes(codes,categories,ordered=True)
    
    ordered_cat
    
    [foo, bar, baz, foo, foo, bar, baxz]
    Categories (4, object): [foo < bar < baz < baxz]
    
    ordered_cat.codes
    
    array([0, 1, 2, 0, 0, 1, 3], dtype=int8)
    
    ordered_cat.categories
    
    Index(['foo', 'bar', 'baz', 'baxz'], dtype='object')
    

    Unordered categorical instance can be made ordered with as_ordered:

    As a last note,categorical data need not be strings,a categorical array can consist of any immutable value types:

    pd.Categorical([(1,2),'a',3])
    
    [(1, 2), a, 3]
    Categories (3, object): [(1, 2), a, 3]
    

    Computations with Categorical

    Using Categorical in pandas compared with the non-encoded version(like an array of strings) generally behaves the same way.Some parts of pandas,like the groupby function,perform better when working with categories.There are also some functions that can utilize the ordered flag.

    Let's consider some random numeric data,and use the panda.qcut binning function.This return pandas.Categorical;We used pandas.cut ealier but glossed over the details of how categoricals work.

    np.random.seed(12345)
    
    draws=np.random.randn(1000)
    
    bins=pd.qcut(draws,4)
    
    bins
    
    [(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63, 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.9499999999999997, -0.684], (-0.0101, 0.63], (0.63, 3.928]]
    Length: 1000
    Categories (4, interval[float64]): [(-2.9499999999999997, -0.684] < (-0.684, -0.0101] < (-0.0101, 0.63] < (0.63, 3.928]]
    
    bins.value_counts()
    
    (-2.9499999999999997, -0.684]    250
    (-0.684, -0.0101]                250
    (-0.0101, 0.63]                  250
    (0.63, 3.928]                    250
    dtype: int64
    
    pd.value_counts(bins)
    
    (0.63, 3.928]                    250
    (-0.0101, 0.63]                  250
    (-0.684, -0.0101]                250
    (-2.9499999999999997, -0.684]    250
    dtype: int64
    
    type(bins)
    
    pandas.core.arrays.categorical.Categorical
    

    While useful,the exact sample quartiles may be less useful for producing a report than quartile names.We can achieve this with the labels argument to qcut:

    bins=pd.qcut(draws,4,labels=['Q1','Q2','Q3','Q4'])
    
    bins
    
    [Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
    Length: 1000
    Categories (4, object): [Q1 < Q2 < Q3 < Q4]
    
    bins.codes[:10]
    
    array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)
    
    bins.categories
    
    Index(['Q1', 'Q2', 'Q3', 'Q4'], dtype='object')
    

    The labeled bins categorical does not contain information about the bin edges in the data,so we can use groupby to extract some summary statistics:

    bins=pd.Series(bins,name='quartile')
    
    pd.Series(draws).groupby(bins).mean()
    
    quartile
    Q1   -1.215981
    Q2   -0.362423
    Q3    0.307840
    Q4    1.261160
    dtype: float64
    
    results=(pd.Series(draws).groupby(bins).agg(['count','min','max','mean']).reset_index())
    
    results
    
    quartile count min max mean
    0 Q1 250 -2.949343 -0.685484 -1.215981
    1 Q2 250 -0.683066 -0.010115 -0.362423
    2 Q3 250 -0.010032 0.628894 0.307840
    3 Q4 250 0.634238 3.927528 1.261160
    bins
    
    0      Q2
    1      Q3
    2      Q2
    3      Q2
    4      Q4
    5      Q4
    6      Q3
    7      Q3
    8      Q4
    9      Q4
    10     Q4
    11     Q1
    12     Q3
    13     Q3
    14     Q4
    15     Q4
    16     Q1
    17     Q2
    18     Q4
    19     Q2
    20     Q2
    21     Q3
    22     Q4
    23     Q1
    24     Q2
    25     Q3
    26     Q3
    27     Q3
    28     Q3
    29     Q4
           ..
    970    Q2
    971    Q1
    972    Q2
    973    Q4
    974    Q3
    975    Q1
    976    Q1
    977    Q2
    978    Q2
    979    Q3
    980    Q3
    981    Q1
    982    Q3
    983    Q4
    984    Q2
    985    Q4
    986    Q1
    987    Q4
    988    Q1
    989    Q3
    990    Q1
    991    Q4
    992    Q1
    993    Q4
    994    Q2
    995    Q3
    996    Q2
    997    Q1
    998    Q3
    999    Q4
    Name: quartile, Length: 1000, dtype: category
    Categories (4, object): [Q1 < Q2 < Q3 < Q4]
    

    Better performance with categoricals

    If you do a lot of analytics on a particular dataset,converting to categorical can yield substantial overall performance gains.A categorical version of a DataFrame column will often use significantly less memory,too.

    N=10000000
    
    draws=pd.Series(np.random.randn(N))
    
    labels=pd.Series(['foo','bar','baz','qux']*(N//4))
    
    12//5
    
    2
    
    12%5
    
    2
    
    categories=labels.astype('category')
    
    labels.memory_usage()
    
    80000080
    
    categories.memory_usage()
    
    10000272
    

    So we can see that,Categorical ueses less memory than Series.While the conversion to category is not free,of course,but it is a one-time cost:

    %time _=labels.astype('category')
    
    Wall time: 400 ms
    

    GroupBy operations can be significantly faster with categoricals because the underlying algoriths use the integer-based codes array instead of an array of strings.

    Categorical methods

    Series containing categorical data have several special methods similar to the Series.str specialized string methods.This also provides convenient access to the categories and codes.

    s=pd.Series(['a','b','c','d']*2)
    
    cat_s=s.astype('category')
    
    cat_s
    
    0    a
    1    b
    2    c
    3    d
    4    a
    5    b
    6    c
    7    d
    dtype: category
    Categories (4, object): [a, b, c, d]
    

    The special attribute cat provides access to categorical methods:

    cat_s.cat.codes
    
    0    0
    1    1
    2    2
    3    3
    4    0
    5    1
    6    2
    7    3
    dtype: int8
    
    cat_s.values.codes
    
    array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int8)
    
    cat_s.values.categories
    
    Index(['a', 'b', 'c', 'd'], dtype='object')
    
    cat_s.cat.categories
    
    Index(['a', 'b', 'c', 'd'], dtype='object')
    

    Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data.We often use the set_categories method to change them:

    actual_categories=['a','b','c','d','e']
    
    cat_s2=cat_s.cat.set_categories(actual_categories)
    
    cat_s2
    
    0    a
    1    b
    2    c
    3    d
    4    a
    5    b
    6    c
    7    d
    dtype: category
    Categories (5, object): [a, b, c, d, e]
    

    While it appears that the data is unchanged,the new categories will be reflected in operations that use them.For example,value_counts respects the categories,if present:

    cat_s.value_counts()
    
    d    2
    c    2
    b    2
    a    2
    dtype: int64
    
    cat_s2.value_counts()
    
    d    2
    c    2
    b    2
    a    2
    e    0
    dtype: int64
    

    In large datasets,categoricals are often used as a convenient fool for memory savings and better performance.After you filter a large DataFrame or Series,many of the categories may not appear in the data.To help with this,we can use the remove_unused_categories method to trim unobserved categories:

    cat_s
    
    0    a
    1    b
    2    c
    3    d
    4    a
    5    b
    6    c
    7    d
    dtype: category
    Categories (4, object): [a, b, c, d]
    
    cat_s3=cat_s[cat_s.isin(['a','b'])]
    
    cat_s.isin(['a','b'])
    
    0     True
    1     True
    2    False
    3    False
    4     True
    5     True
    6    False
    7    False
    dtype: bool
    
    cat_s3
    
    0    a
    1    b
    4    a
    5    b
    dtype: category
    Categories (4, object): [a, b, c, d]
    
    cat_s[[True,False,False,False,False,False,True,False]]
    
    0    a
    6    c
    dtype: category
    Categories (4, object): [a, b, c, d]
    
    cat_s3.cat.remove_unused_categories()
    
    0    a
    1    b
    4    a
    5    b
    dtype: category
    Categories (2, object): [a, b]
    

    Categorical methods for Series in pandas:

    • add_categories---> Append new(unused) categories at end of existing categories
    • as_ordered-------->Make categories ordered
    • as_unordered----->Make categories unordered
    • remove_categories--> Remove categories,setting any removed values to null
    • remove_unused_categories-->Remove any category values which do not appear in the data
    • rename_categories---->Replace categories with indicated set of new category names;cannot change the number of categories
    • reorder_categories--->Behaves like rename_categories,but can also change the result to have ordered categories
    • set_categories------->Replcace the categories with the indicated set of new categories;can add or remove categories

    Creating dummy variables for modeling

    When you are using statistics or machine learning tools,you will often transform categorical data into dummy variables,also known as one-hot encoding.This involves creating a DataFrame with a column for each distinct category;these columns contain 1s for occurrences of a given category and 0 otherwise;

    cat_s=pd.Series(['a','b','c','d']*2,dtype='category')
    

    The pandas.get_dummies function converts this one-dimensional categorical data into a DataFrame containing the dummy variable:

    pd.get_dummies(cat_s)
    
    a b c d
    0 1 0 0 0
    1 0 1 0 0
    2 0 0 1 0
    3 0 0 0 1
    4 1 0 0 0
    5 0 1 0 0
    6 0 0 1 0
    7 0 0 0 1

    Advanced GroupBy use

    df
    
    time value
    0 2020-05-23 00:00:00 0
    1 2020-05-23 00:01:00 1
    2 2020-05-23 00:02:00 2
    3 2020-05-23 00:03:00 3
    4 2020-05-23 00:04:00 4
    5 2020-05-23 00:05:00 5
    6 2020-05-23 00:06:00 6
    7 2020-05-23 00:07:00 7
    8 2020-05-23 00:08:00 8
    9 2020-05-23 00:09:00 9
    10 2020-05-23 00:10:00 10
    11 2020-05-23 00:11:00 11
    12 2020-05-23 00:12:00 12
    13 2020-05-23 00:13:00 13
    14 2020-05-23 00:14:00 14
    df.groupby('fruit').max()
    
    ---------------------------------------------------------------------------
    
    KeyError                                  Traceback (most recent call last)
    
    <ipython-input-113-79d22643c2e2> in <module>()
    ----> 1 df.groupby('fruit').max()
    
    
    D:Anacondalibsite-packagespandascoregeneric.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, **kwargs)
       6663         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
       6664                        sort=sort, group_keys=group_keys, squeeze=squeeze,
    -> 6665                        observed=observed, **kwargs)
       6666 
       6667     def asfreq(self, freq, method=None, how=None, normalize=False,
    
    
    D:Anacondalibsite-packagespandascoregroupbygroupby.py in groupby(obj, by, **kwds)
       2150         raise TypeError('invalid type: %s' % type(obj))
       2151 
    -> 2152     return klass(obj, by, **kwds)
       2153 
       2154 
    
    
    D:Anacondalibsite-packagespandascoregroupbygroupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, **kwargs)
        597                                                     sort=sort,
        598                                                     observed=observed,
    --> 599                                                     mutated=self.mutated)
        600 
        601         self.obj = obj
    
    
    D:Anacondalibsite-packagespandascoregroupbygroupby.py in _get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
       3289                 in_axis, name, level, gpr = False, None, gpr, None
       3290             else:
    -> 3291                 raise KeyError(gpr)
       3292         elif isinstance(gpr, Grouper) and gpr.key is not None:
       3293             # Add key to exclusions
    
    
    KeyError: 'fruit'
    

    Groupby transforms and 'unwrapped' groupbys

    There is a built-in method called transform,which is similar to apply but imposes more constraints on the kind of function you can use:

    • It can produce a scalar value to be broadcast to the shape of the group
    • It can produce an object of the same shape as the input group
    • It must not mutate its input
    df=pd.DataFrame({'key':['a','b','c']*4,'value':np.arange(12)})
    
    df
    
    key value
    0 a 0
    1 b 1
    2 c 2
    3 a 3
    4 b 4
    5 c 5
    6 a 6
    7 b 7
    8 c 8
    9 a 9
    10 b 10
    11 c 11
    g=df.groupby('key').value
    
    g
    
    <pandas.core.groupby.groupby.SeriesGroupBy object at 0x00000277954F00B8>
    
    g.apply(np.mean)
    
    key
    a    4.5
    b    5.5
    c    6.5
    Name: value, dtype: float64
    
    g.apply(lambda x:x.mean())
    
    key
    a    4.5
    b    5.5
    c    6.5
    Name: value, dtype: float64
    
    g.apply(lambda x:x.max()-x.min()) # x represents the subgroup in group.
    
    key
    a    9
    b    9
    c    9
    Name: value, dtype: int64
    

    Suppose instead we wanted to produce a Series of the same shape as df['value']but with values replaced by the average grouped by 'key',we can pass the function lambda x:x.mean() to transform:

    g.transform(lambda x:x.mean())
    
    0     4.5
    1     5.5
    2     6.5
    3     4.5
    4     5.5
    5     6.5
    6     4.5
    7     5.5
    8     6.5
    9     4.5
    10    5.5
    11    6.5
    Name: value, dtype: float64
    

    For built-in aggregation functions,we can pass a string alias as with the groupby agg method:

    g.transform('mean')
    
    0     4.5
    1     5.5
    2     6.5
    3     4.5
    4     5.5
    5     6.5
    6     4.5
    7     5.5
    8     6.5
    9     4.5
    10    5.5
    11    6.5
    Name: value, dtype: float64
    

    Like apply,transform works with functions that return Series,but the result must be the same size as the input.

    g.transform(lambda x:x*2)
    
    0      0
    1      2
    2      4
    3      6
    4      8
    5     10
    6     12
    7     14
    8     16
    9     18
    10    20
    11    22
    Name: value, dtype: int32
    
    def normalize(x):
        return (x-x.mean())/x.std()
    
    g.transform(normalize)
    
    0    -1.161895
    1    -1.161895
    2    -1.161895
    3    -0.387298
    4    -0.387298
    5    -0.387298
    6     0.387298
    7     0.387298
    8     0.387298
    9     1.161895
    10    1.161895
    11    1.161895
    Name: value, dtype: float64
    
    g.apply(normalize)
    
    0    -1.161895
    1    -1.161895
    2    -1.161895
    3    -0.387298
    4    -0.387298
    5    -0.387298
    6     0.387298
    7     0.387298
    8     0.387298
    9     1.161895
    10    1.161895
    11    1.161895
    Name: value, dtype: float64
    

    Built-in aggregate functions like 'mean' or 'sum' are often much faster than a general apply function.Thses also have a fast patst when used with transform.This allows us to perform a so-called unwrapped group operation:

    g.transform('mean')
    
    0     4.5
    1     5.5
    2     6.5
    3     4.5
    4     5.5
    5     6.5
    6     4.5
    7     5.5
    8     6.5
    9     4.5
    10    5.5
    11    6.5
    Name: value, dtype: float64
    
    (df['value']-g.transform('mean'))/g.transform('std')
    
    0    -1.161895
    1    -1.161895
    2    -1.161895
    3    -0.387298
    4    -0.387298
    5    -0.387298
    6     0.387298
    7     0.387298
    8     0.387298
    9     1.161895
    10    1.161895
    11    1.161895
    Name: value, dtype: float64
    

    Grouped time resampling

    For time series,the resamplemethod is semantically a group operation based on a time intervalization.

    N=15
    
    times=pd.date_range('2020--5-23 00:00',freq='1min',periods=N)
    
    df=pd.DataFrame({'time':times,
                    'value':np.arange(N)})
    
    df
    
    time value
    0 2020-05-23 00:00:00 0
    1 2020-05-23 00:01:00 1
    2 2020-05-23 00:02:00 2
    3 2020-05-23 00:03:00 3
    4 2020-05-23 00:04:00 4
    5 2020-05-23 00:05:00 5
    6 2020-05-23 00:06:00 6
    7 2020-05-23 00:07:00 7
    8 2020-05-23 00:08:00 8
    9 2020-05-23 00:09:00 9
    10 2020-05-23 00:10:00 10
    11 2020-05-23 00:11:00 11
    12 2020-05-23 00:12:00 12
    13 2020-05-23 00:13:00 13
    14 2020-05-23 00:14:00 14

    Here,we can index by 'time' and then resample:

    df.set_index('time').resample('5min').count()
    
    value
    time
    2020-05-23 00:00:00 5
    2020-05-23 00:05:00 5
    2020-05-23 00:10:00 5
    help(pd.DataFrame.resample)
    
    Help on function resample in module pandas.core.generic:
    
    resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)
        Convenience method for frequency conversion and resampling of time
        series.  Object must have a datetime-like index (DatetimeIndex,
        PeriodIndex, or TimedeltaIndex), or pass datetime-like values
        to the on or level keyword.
        
        Parameters
        ----------
        rule : string
            the offset string or object representing target conversion
        axis : int, optional, default 0
        closed : {'right', 'left'}
            Which side of bin interval is closed. The default is 'left'
            for all frequency offsets except for 'M', 'A', 'Q', 'BM',
            'BA', 'BQ', and 'W' which all have a default of 'right'.
        label : {'right', 'left'}
            Which bin edge label to label bucket with. The default is 'left'
            for all frequency offsets except for 'M', 'A', 'Q', 'BM',
            'BA', 'BQ', and 'W' which all have a default of 'right'.
        convention : {'start', 'end', 's', 'e'}
            For PeriodIndex only, controls whether to use the start or end of
            `rule`
        kind: {'timestamp', 'period'}, optional
            Pass 'timestamp' to convert the resulting index to a
            ``DateTimeIndex`` or 'period' to convert it to a ``PeriodIndex``.
            By default the input representation is retained.
        loffset : timedelta
            Adjust the resampled time labels
        base : int, default 0
            For frequencies that evenly subdivide 1 day, the "origin" of the
            aggregated intervals. For example, for '5min' frequency, base could
            range from 0 through 4. Defaults to 0
        on : string, optional
            For a DataFrame, column to use instead of index for resampling.
            Column must be datetime-like.
        
            .. versionadded:: 0.19.0
        
        level : string or int, optional
            For a MultiIndex, level (name or number) to use for
            resampling.  Level must be datetime-like.
        
            .. versionadded:: 0.19.0
        
        Returns
        -------
        Resampler object
        
        Notes
        -----
        See the `user guide
        <http://pandas.pydata.org/pandas-docs/stable/timeseries.html#resampling>`_
        for more.
        
        To learn more about the offset strings, please see `this link
        <http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases>`__.
        
        Examples
        --------
        
        Start by creating a series with 9 one minute timestamps.
        
        >>> index = pd.date_range('1/1/2000', periods=9, freq='T')
        >>> series = pd.Series(range(9), index=index)
        >>> series
        2000-01-01 00:00:00    0
        2000-01-01 00:01:00    1
        2000-01-01 00:02:00    2
        2000-01-01 00:03:00    3
        2000-01-01 00:04:00    4
        2000-01-01 00:05:00    5
        2000-01-01 00:06:00    6
        2000-01-01 00:07:00    7
        2000-01-01 00:08:00    8
        Freq: T, dtype: int64
        
        Downsample the series into 3 minute bins and sum the values
        of the timestamps falling into a bin.
        
        >>> series.resample('3T').sum()
        2000-01-01 00:00:00     3
        2000-01-01 00:03:00    12
        2000-01-01 00:06:00    21
        Freq: 3T, dtype: int64
        
        Downsample the series into 3 minute bins as above, but label each
        bin using the right edge instead of the left. Please note that the
        value in the bucket used as the label is not included in the bucket,
        which it labels. For example, in the original series the
        bucket ``2000-01-01 00:03:00`` contains the value 3, but the summed
        value in the resampled bucket with the label ``2000-01-01 00:03:00``
        does not include 3 (if it did, the summed value would be 6, not 3).
        To include this value close the right side of the bin interval as
        illustrated in the example below this one.
        
        >>> series.resample('3T', label='right').sum()
        2000-01-01 00:03:00     3
        2000-01-01 00:06:00    12
        2000-01-01 00:09:00    21
        Freq: 3T, dtype: int64
        
        Downsample the series into 3 minute bins as above, but close the right
        side of the bin interval.
        
        >>> series.resample('3T', label='right', closed='right').sum()
        2000-01-01 00:00:00     0
        2000-01-01 00:03:00     6
        2000-01-01 00:06:00    15
        2000-01-01 00:09:00    15
        Freq: 3T, dtype: int64
        
        Upsample the series into 30 second bins.
        
        >>> series.resample('30S').asfreq()[0:5] #select first 5 rows
        2000-01-01 00:00:00   0.0
        2000-01-01 00:00:30   NaN
        2000-01-01 00:01:00   1.0
        2000-01-01 00:01:30   NaN
        2000-01-01 00:02:00   2.0
        Freq: 30S, dtype: float64
        
        Upsample the series into 30 second bins and fill the ``NaN``
        values using the ``pad`` method.
        
        >>> series.resample('30S').pad()[0:5]
        2000-01-01 00:00:00    0
        2000-01-01 00:00:30    0
        2000-01-01 00:01:00    1
        2000-01-01 00:01:30    1
        2000-01-01 00:02:00    2
        Freq: 30S, dtype: int64
        
        Upsample the series into 30 second bins and fill the
        ``NaN`` values using the ``bfill`` method.
        
        >>> series.resample('30S').bfill()[0:5]
        2000-01-01 00:00:00    0
        2000-01-01 00:00:30    1
        2000-01-01 00:01:00    1
        2000-01-01 00:01:30    2
        2000-01-01 00:02:00    2
        Freq: 30S, dtype: int64
        
        Pass a custom function via ``apply``
        
        >>> def custom_resampler(array_like):
        ...     return np.sum(array_like)+5
        
        >>> series.resample('3T').apply(custom_resampler)
        2000-01-01 00:00:00     8
        2000-01-01 00:03:00    17
        2000-01-01 00:06:00    26
        Freq: 3T, dtype: int64
        
        For a Series with a PeriodIndex, the keyword `convention` can be
        used to control whether to use the start or end of `rule`.
        
        >>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
                                                        freq='A',
                                                        periods=2))
        >>> s
        2012    1
        2013    2
        Freq: A-DEC, dtype: int64
        
        Resample by month using 'start' `convention`. Values are assigned to
        the first month of the period.
        
        >>> s.resample('M', convention='start').asfreq().head()
        2012-01    1.0
        2012-02    NaN
        2012-03    NaN
        2012-04    NaN
        2012-05    NaN
        Freq: M, dtype: float64
        
        Resample by month using 'end' `convention`. Values are assigned to
        the last month of the period.
        
        >>> s.resample('M', convention='end').asfreq()
        2012-12    1.0
        2013-01    NaN
        2013-02    NaN
        2013-03    NaN
        2013-04    NaN
        2013-05    NaN
        2013-06    NaN
        2013-07    NaN
        2013-08    NaN
        2013-09    NaN
        2013-10    NaN
        2013-11    NaN
        2013-12    2.0
        Freq: M, dtype: float64
        
        For DataFrame objects, the keyword ``on`` can be used to specify the
        column instead of the index for resampling.
        
        >>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd'])
        >>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T')
        >>> df.resample('3T', on='time').sum()
                             a  b  c  d
        time
        2000-01-01 00:00:00  0  3  6  9
        2000-01-01 00:03:00  0  3  6  9
        2000-01-01 00:06:00  0  3  6  9
        
        For a DataFrame with MultiIndex, the keyword ``level`` can be used to
        specify on level the resampling needs to take place.
        
        >>> time = pd.date_range('1/1/2000', periods=5, freq='T')
        >>> df2 = pd.DataFrame(data=10*[range(4)],
                               columns=['a', 'b', 'c', 'd'],
                               index=pd.MultiIndex.from_product([time, [1, 2]])
                               )
        >>> df2.resample('3T', level=0).sum()
                             a  b   c   d
        2000-01-01 00:00:00  0  6  12  18
        2000-01-01 00:03:00  0  4   8  12
        
        See also
        --------
        groupby : Group by mapping, function, label, or list of labels.
    
    help(np.tile)
    
    Help on function tile in module numpy.lib.shape_base:
    
    tile(A, reps)
        Construct an array by repeating A the number of times given by reps.
        
        If `reps` has length ``d``, the result will have dimension of
        ``max(d, A.ndim)``.
        
        If ``A.ndim < d``, `A` is promoted to be d-dimensional by prepending new
        axes. So a shape (3,) array is promoted to (1, 3) for 2-D replication,
        or shape (1, 1, 3) for 3-D replication. If this is not the desired
        behavior, promote `A` to d-dimensions manually before calling this
        function.
        
        If ``A.ndim > d``, `reps` is promoted to `A`.ndim by pre-pending 1's to it.
        Thus for an `A` of shape (2, 3, 4, 5), a `reps` of (2, 2) is treated as
        (1, 1, 2, 2).
        
        Note : Although tile may be used for broadcasting, it is strongly
        recommended to use numpy's broadcasting operations and functions.
        
        Parameters
        ----------
        A : array_like
            The input array.
        reps : array_like
            The number of repetitions of `A` along each axis.
        
        Returns
        -------
        c : ndarray
            The tiled output array.
        
        See Also
        --------
        repeat : Repeat elements of an array.
        broadcast_to : Broadcast an array to a new shape
        
        Examples
        --------
        >>> a = np.array([0, 1, 2])
        >>> np.tile(a, 2)
        array([0, 1, 2, 0, 1, 2])
        >>> np.tile(a, (2, 2))
        array([[0, 1, 2, 0, 1, 2],
               [0, 1, 2, 0, 1, 2]])
        >>> np.tile(a, (2, 1, 2))
        array([[[0, 1, 2, 0, 1, 2]],
               [[0, 1, 2, 0, 1, 2]]])
        
        >>> b = np.array([[1, 2], [3, 4]])
        >>> np.tile(b, 2)
        array([[1, 2, 1, 2],
               [3, 4, 3, 4]])
        >>> np.tile(b, (2, 1))
        array([[1, 2],
               [3, 4],
               [1, 2],
               [3, 4]])
        
        >>> c = np.array([1,2,3,4])
        >>> np.tile(c,(4,1))
        array([[1, 2, 3, 4],
               [1, 2, 3, 4],
               [1, 2, 3, 4],
               [1, 2, 3, 4]])
    
    df2=pd.DataFrame({'time':times.repeat(3),
                     'key':np.tile(['a','b','c'],N),
                     'value':np.arange(N*3)})
    
    df2
    
    time key value
    0 2020-05-23 00:00:00 a 0
    1 2020-05-23 00:00:00 b 1
    2 2020-05-23 00:00:00 c 2
    3 2020-05-23 00:01:00 a 3
    4 2020-05-23 00:01:00 b 4
    5 2020-05-23 00:01:00 c 5
    6 2020-05-23 00:02:00 a 6
    7 2020-05-23 00:02:00 b 7
    8 2020-05-23 00:02:00 c 8
    9 2020-05-23 00:03:00 a 9
    10 2020-05-23 00:03:00 b 10
    11 2020-05-23 00:03:00 c 11
    12 2020-05-23 00:04:00 a 12
    13 2020-05-23 00:04:00 b 13
    14 2020-05-23 00:04:00 c 14
    15 2020-05-23 00:05:00 a 15
    16 2020-05-23 00:05:00 b 16
    17 2020-05-23 00:05:00 c 17
    18 2020-05-23 00:06:00 a 18
    19 2020-05-23 00:06:00 b 19
    20 2020-05-23 00:06:00 c 20
    21 2020-05-23 00:07:00 a 21
    22 2020-05-23 00:07:00 b 22
    23 2020-05-23 00:07:00 c 23
    24 2020-05-23 00:08:00 a 24
    25 2020-05-23 00:08:00 b 25
    26 2020-05-23 00:08:00 c 26
    27 2020-05-23 00:09:00 a 27
    28 2020-05-23 00:09:00 b 28
    29 2020-05-23 00:09:00 c 29
    30 2020-05-23 00:10:00 a 30
    31 2020-05-23 00:10:00 b 31
    32 2020-05-23 00:10:00 c 32
    33 2020-05-23 00:11:00 a 33
    34 2020-05-23 00:11:00 b 34
    35 2020-05-23 00:11:00 c 35
    36 2020-05-23 00:12:00 a 36
    37 2020-05-23 00:12:00 b 37
    38 2020-05-23 00:12:00 c 38
    39 2020-05-23 00:13:00 a 39
    40 2020-05-23 00:13:00 b 40
    41 2020-05-23 00:13:00 c 41
    42 2020-05-23 00:14:00 a 42
    43 2020-05-23 00:14:00 b 43
    44 2020-05-23 00:14:00 c 44

    To do the same resampling for each value of 'key' for each value of 'key',we introduce the pandas.TimeGrouper object:

    resample=df2.groupby(['key','time']).sum()
    
    resample
    
    value
    key time
    a 2020-05-23 00:00:00 0
    2020-05-23 00:01:00 3
    2020-05-23 00:02:00 6
    2020-05-23 00:03:00 9
    2020-05-23 00:04:00 12
    2020-05-23 00:05:00 15
    2020-05-23 00:06:00 18
    2020-05-23 00:07:00 21
    2020-05-23 00:08:00 24
    2020-05-23 00:09:00 27
    2020-05-23 00:10:00 30
    2020-05-23 00:11:00 33
    2020-05-23 00:12:00 36
    2020-05-23 00:13:00 39
    2020-05-23 00:14:00 42
    b 2020-05-23 00:00:00 1
    2020-05-23 00:01:00 4
    2020-05-23 00:02:00 7
    2020-05-23 00:03:00 10
    2020-05-23 00:04:00 13
    2020-05-23 00:05:00 16
    2020-05-23 00:06:00 19
    2020-05-23 00:07:00 22
    2020-05-23 00:08:00 25
    2020-05-23 00:09:00 28
    2020-05-23 00:10:00 31
    2020-05-23 00:11:00 34
    2020-05-23 00:12:00 37
    2020-05-23 00:13:00 40
    2020-05-23 00:14:00 43
    c 2020-05-23 00:00:00 2
    2020-05-23 00:01:00 5
    2020-05-23 00:02:00 8
    2020-05-23 00:03:00 11
    2020-05-23 00:04:00 14
    2020-05-23 00:05:00 17
    2020-05-23 00:06:00 20
    2020-05-23 00:07:00 23
    2020-05-23 00:08:00 26
    2020-05-23 00:09:00 29
    2020-05-23 00:10:00 32
    2020-05-23 00:11:00 35
    2020-05-23 00:12:00 38
    2020-05-23 00:13:00 41
    2020-05-23 00:14:00 44
    time_key=pd.TimeGrouper('5min')
    
    D:Anacondalibsite-packagesipykernel_launcher.py:1: FutureWarning: pd.TimeGrouper is deprecated and will be removed; Please use pd.Grouper(freq=...)
      """Entry point for launching an IPython kernel.
    
    resample=(df2.set_index('time').groupby(['key',time_key]).sum())
    
    resample
    
    value
    key time
    a 2020-05-23 00:00:00 30
    2020-05-23 00:05:00 105
    2020-05-23 00:10:00 180
    b 2020-05-23 00:00:00 35
    2020-05-23 00:05:00 110
    2020-05-23 00:10:00 185
    c 2020-05-23 00:00:00 40
    2020-05-23 00:05:00 115
    2020-05-23 00:10:00 190

    Techniques for method chainning

    When applying a sequence of transformations to a dataset,you may find yourself creatingf numerous temporary variables that are never used in your analysis.


    pseudocode here:

    df=load_data()

    df2=df[df['col2']<0]

    df2['col1_demeaned']=df2['col1']-df2['col1'].mean()

    result=df2.groupby('key').col1_demeaned.std()


    While we are not using any real data here,this example highlights some new methods.First,the DataFrame.assign method is a functional alternative to column assignments of the form df[k]=v.Rather than modifying the object in-place,it returns a new DataFrame with the indicated modifications.So these statements are equivalent:

    Usual non-functional way

    df2=df.copy()
    df2['k']=v

    Functional assign way

    df2=df.assign(k=v)

    Assigning in-place may execute faster than using assign,but assign enables easier method chaining:

    result=(df2.assign(col1_demeqaned=df2.col1-df2.col2.mean())

      .groupby('key')
      
      .col1_demeaned.std())
    

    Version2

    result=(load_data()[load_data()['col2']<0]

      .assign(col1_demeaned=df2.col1-df2.col2.mean())   # have to refering to df2....
      
      .groupby('key')
      
      .col1_demeaned.std()
      
      )
    

    We use the outer parentheses to make it more convenient to add line breaks.

    One thing to keep in mind when doing method chaining is that you may need to refer to temporary objects.In the preceding example, we cannot refer to the result of load_data until it has been assigning to the temporary variable df.To help with this assign and many other pandas functions accept function-like arguments,also known as callables.

    To show callables in action, consider a fragment of the example from before:

    df=load_data()
    df2=df[df['col2']<0]

    version 2

    df=(load_data()
    [lambda x:x['col2']<0]) # alternative method of refering to temporary objects using callables

    result=(load_data()
    [lambda x:x.col2<0]

    .assign(col1_demeaned=lambda x:x.col1-x.col1.mean())
    
    .groupby('key')
    
    .col1_demeaned.std())
    

    The pipe method

    You can accomplish a lot with built-in pandas functions and the approaches to method chaining with callables taht we just looked at.However,sometimes you need to use your own functions or funcitons from third-party libraries.This is where the pipe method comes in.

    Consider a sequence of functions calls:

    a=f(df,arg1=v1)

    b=g(a,v2,arg3=v3)

    c=h(b,arg4=v4)

    When using functions that accept and returns Series or DataFrame objects,you can rewrite this using calls to pipe:

    result=(df.pipe(f,arg1=v1)

    .pipe(g,v2,arg3=v3)

    .pipe(h,arg4=v4))

    The statement f(df) and df.pipe(f) are equivalent,but pipe makes chained inocation easier.

    def group_demean(df,by,cols):
        result=df.copy()
        g=df.groupby(by)
        for c in cols:
            result[c]=df[c]-g[c].transform('mean')
        return result
    

    result=(df[df.col1[<0]

    .pipe(group_demean,['key1','key2'],['col1']])

    
    
    ##### 愿你一寸一寸地攻城略地,一点一点地焕然一新 #####
  • 相关阅读:
    json 拼二维json数组
    lnmp 环境搭建
    lnmp git ruby sass 安装
    制定clone的用户名
    git branch 重命名
    sql命令创建数据库
    layerui
    弹出确认框,打开新窗口
    Ajax请求小结
    .net core注入服务
  • 原文地址:https://www.cnblogs.com/johnyang/p/12945329.html
Copyright © 2011-2022 走看看