zoukankan      html  css  js  c++  java
  • Time Series in pandas

    Time Series

    import pandas as pd
    
    import numpy as np
    

    Date and Time data types and tools

    from datetime import datetime
    from datetime import date
    from datetime import timedelta
    
    now=datetime.now()
    
    now
    
    datetime.datetime(2020, 5, 5, 9, 51, 27, 686891)
    
    now.year,now.month,now.day
    
    (2020, 5, 5)
    
    now.time()
    
    datetime.time(9, 51, 27, 686891)
    

    datetime stores both the date and time down to the microsecond.timedelta represents the temporal difference between two datetime objects.

    delta=datetime(2020,4,29)-datetime(2018,8,19)
    
    delta
    
    datetime.timedelta(619)
    
    delta.days
    
    619
    
    delta.seconds
    
    0
    

    You can add(or subtract) a timedelta or multiple thereof to a datetime object to yield a new shifted object:

    start=datetime(2020,4,30)
    
    start+timedelta(100)
    
    datetime.datetime(2020, 8, 8, 0, 0)
    
    datetime.now()
    
    datetime.datetime(2020, 5, 5, 9, 51, 27, 888352)
    
    date(2008,8,8)-date(2000,8,8)
    
    datetime.timedelta(2922)
    

    Converting between string and datetime

    You can format datetime and pandas Timestamp objects ,as string using str or strftime method,passing a format specification.

    stamp=datetime(2011,1,3)
    
    str(stamp)
    
    '2011-01-03 00:00:00'
    
    stamp.strftime('%Y--%m-!-%d') # can be meomorized by 'str from time', datetime object instance method
    
    '2011--01-!-03'
    
    description={'%Y':'four-digit year',
               '%y':'two-digit year',
               '%m':'two-digit month[01,12]',
               '%d':'two-digit day[01,31]',
               '%H':'Hour(24-hour clock)[00,23]',
               '%I':'Hour(12-hour clock)[01,12]',
               '%M':'two-digit minute[00,59]',
               '%S':'second[00,61](seconds60,61 account for leap seconds)',
               '%w':'weekday as integer[0(sunday),6]',
               '%U':'week number of the year[00,53];sunday is considered the first day of the first day of the week,and days before the first sunday of the year are "week 0"',
               '%W':'week number of the year[00,53];monday is considered the first day of the week and days before the first monday of the year are "week 0"',
               '%z':'UTC time zone offset as +HHMM or -HHMM;empty if time zone naive',
               '%F':'shortcut for %Y-%m-%d',
               '%D':'shortcut for %m/%d/%y(e.g04/18/12)'}
    
    pd.DataFrame(description,index=[0])
    
    %Y %y %m %d %H %I %M %S %w %U %W %z %F %D
    0 four-digit year two-digit year two-digit month[01,12] two-digit day[01,31] Hour(24-hour clock)[00,23] Hour(12-hour clock)[01,12] two-digit minute[00,59] second[00,61](seconds60,61 account for leap se... weekday as integer[0(sunday),6] week number of the year[00,53];sunday is consi... week number of the year[00,53];monday is consi... UTC time zone offset as +HHMM or -HHMM;empty i... shortcut for %Y-%m-%d shortcut for %m/%d/%y(e.g04/18/12)
    pd.DataFrame(description,index=[0]).stack()
    
    0  %Y                                      four-digit year
       %y                                       two-digit year
       %m                               two-digit month[01,12]
       %d                                 two-digit day[01,31]
       %H                           Hour(24-hour clock)[00,23]
       %I                           Hour(12-hour clock)[01,12]
       %M                              two-digit minute[00,59]
       %S    second[00,61](seconds60,61 account for leap se...
       %w                      weekday as integer[0(sunday),6]
       %U    week number of the year[00,53];sunday is consi...
       %W    week number of the year[00,53];monday is consi...
       %z    UTC time zone offset as +HHMM or -HHMM;empty i...
       %F                                shortcut for %Y-%m-%d
       %D                   shortcut for %m/%d/%y(e.g04/18/12)
    dtype: object
    

    You can use these same format codes to convert strings to dates using datetime.strptime:

    value='2011-01-03'
    
    datetime.strptime(value,'%Y-%m-%d')  # can be memorized by 'str produce time'  datetime model method
    
    datetime.datetime(2011, 1, 3, 0, 0)
    
    datestrs=['7/6/2011','8/6/2011']
    
    [datetime.strptime(x,'%m/%d/%Y') for x in datestrs]
    
    [datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]
    

    datetime.strptime is a good way to parse a date with a known format.However,it can be a bit annoyning to have to write a format spec each time,especially for common date formats.In this case,you can use the parser.parse method in the third-party dateutil package(this is installed automatically when you install pandas):

    from dateutil.parser import parse  # notice that, it is `parse` not `parser`!
    
    parse('2020--4--30')
    
    datetime.datetime(2020, 4, 30, 0, 0)
    

    dateutil is capable of parsing most human-intelligible date representation:

    parse('Jan 31,1997 10:45PM')  #? what happened here?
    
    datetime.datetime(2020, 1, 31, 22, 45)
    

    In international locales,day appearing before month is very common,so you can pass dayfirst=True to indicate this:

    parse('6/12/2022',dayfirst=True)
    
    datetime.datetime(2022, 12, 6, 0, 0)
    

    pandas is generally oriented toward working with arrays of date,whether used as an axis index or a column in a DataFrame.The to_datetime method parses many different kinds of date representations.Standard date formats like ISO 8601 can be parsed very quickly.

    datestrs=['2011-07-16 12:00:00','2011-08-06 00:00:00']
    
    pd.to_datetime(['2011-07-16 ','2011-08-06 '])
    
    DatetimeIndex(['2011-07-16', '2011-08-06'], dtype='datetime64[ns]', freq=None)
    
    pd.to_datetime(datestrs)
    
    DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)
    

    It also handles values that should be considered missing (None,empty string,etc.):

    idx=pd.to_datetime(datestrs+[None]);idx
    
    DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)
    

    NaT(not a time) is pandas's null value for timestamp data.

    Notice
    dateutil.parse is a useful but imperfect tool.

    Time series basics

    A basic kind of time series object in pandas is a Series indexed by timestamps,which is often represented external to pandas as Python strings or datetime objects.

    dates=[datetime(2020,1,2),datetime(2020,1,5),datetime(2020,1,7),datetime(2020,1,8),datetime(2020,1,10),datetime(2020,1,12)]
    
    ts=pd.Series(np.random.randn(6),index=dates);ts
    
    2020-01-02   -1.140949
    2020-01-05   -0.328999
    2020-01-07   -0.046164
    2020-01-08   -0.783714
    2020-01-10   -0.126047
    2020-01-12   -0.848602
    dtype: float64
    
    pd.Series(np.random.randn(3),index=idx)
    
    2011-07-16 12:00:00    2.259805
    2011-08-06 00:00:00   -0.877063
    NaT                   -0.697678
    dtype: float64
    

    Under the hood,these datetime objects have been put in a DatetimeIndex:

    ts.index
    
    DatetimeIndex(['2020-01-02', '2020-01-05', '2020-01-07', '2020-01-08',
                   '2020-01-10', '2020-01-12'],
                  dtype='datetime64[ns]', freq=None)
    

    Like other Series,arithmetic operations between differently indexed time series automatically align on the dates:

    ts[::2]
    
    2020-01-02   -1.140949
    2020-01-07   -0.046164
    2020-01-10   -0.126047
    dtype: float64
    
    ts+ts[::2]
    
    2020-01-02   -2.281898
    2020-01-05         NaN
    2020-01-07   -0.092328
    2020-01-08         NaN
    2020-01-10   -0.252095
    2020-01-12         NaN
    dtype: float64
    
    pd.Series([1,2,3])+pd.Series([3,4])
    
    0    4.0
    1    6.0
    2    NaN
    dtype: float64
    

    pandas stores timestamps using Numpy's datetime64 data type at the nanosecond resolution.

    ts.index.dtype
    
    dtype('<M8[ns]')
    

    Scalar values from a DatetimeIndex are pandas Timestamp objects:

    stamp=ts.index[0];stamp
    
    Timestamp('2020-01-02 00:00:00')
    

    A Timestamp can be substituted anywhere you would use a datetime object.Addtionally,it can store frequency information(if any) and understands how to do time zone conversions and other kinds of manipulations.

    Indexing,selection,subsetting

    ts
    
    2020-01-02   -1.140949
    2020-01-05   -0.328999
    2020-01-07   -0.046164
    2020-01-08   -0.783714
    2020-01-10   -0.126047
    2020-01-12   -0.848602
    dtype: float64
    

    Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:

    stamp=ts.index[2]
    
    ts[stamp]
    
    -0.04616414830843706
    

    As a convenience,you can also pass a string that is interpretable as a date:

    ts['1/10/2020']
    
    -0.12604738036158042
    
    ts['2020-1-10']
    
    -0.12604738036158042
    

    For longer time series,a year or only a year and month can be passed to easily select slices of data:

    help(pd.date_range)
    
    Help on function date_range in module pandas.core.indexes.datetimes:
    
    date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
        Return a fixed frequency DatetimeIndex.
        
        Parameters
        ----------
        start : str or datetime-like, optional
            Left bound for generating dates.
        end : str or datetime-like, optional
            Right bound for generating dates.
        periods : integer, optional
            Number of periods to generate.
        freq : str or DateOffset, default 'D'
            Frequency strings can have multiples, e.g. '5H'. See
            :ref:`here <timeseries.offset_aliases>` for a list of
            frequency aliases.
        tz : str or tzinfo, optional
            Time zone name for returning localized DatetimeIndex, for example
            'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
            timezone-naive.
        normalize : bool, default False
            Normalize start/end dates to midnight before generating date range.
        name : str, default None
            Name of the resulting DatetimeIndex.
        closed : {None, 'left', 'right'}, optional
            Make the interval closed with respect to the given frequency to
            the 'left', 'right', or both sides (None, the default).
        **kwargs
            For compatibility. Has no effect on the result.
        
        Returns
        -------
        rng : DatetimeIndex
        
        See Also
        --------
        DatetimeIndex : An immutable container for datetimes.
        timedelta_range : Return a fixed frequency TimedeltaIndex.
        period_range : Return a fixed frequency PeriodIndex.
        interval_range : Return a fixed frequency IntervalIndex.
        
        Notes
        -----
        Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
        exactly three must be specified. If ``freq`` is omitted, the resulting
        ``DatetimeIndex`` will have ``periods`` linearly spaced elements between
        ``start`` and ``end`` (closed on both sides).
        
        To learn more about the frequency strings, please see `this link
        <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
        
        Examples
        --------
        **Specifying the values**
        
        The next four examples generate the same `DatetimeIndex`, but vary
        the combination of `start`, `end` and `periods`.
        
        Specify `start` and `end`, with the default daily frequency.
        
        >>> pd.date_range(start='1/1/2018', end='1/08/2018')
        DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                       '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                      dtype='datetime64[ns]', freq='D')
        
        Specify `start` and `periods`, the number of periods (days).
        
        >>> pd.date_range(start='1/1/2018', periods=8)
        DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                       '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                      dtype='datetime64[ns]', freq='D')
        
        Specify `end` and `periods`, the number of periods (days).
        
        >>> pd.date_range(end='1/1/2018', periods=8)
        DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
                       '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
                      dtype='datetime64[ns]', freq='D')
        
        Specify `start`, `end`, and `periods`; the frequency is generated
        automatically (linearly spaced).
        
        >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
        DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
                       '2018-04-27 00:00:00'],
                      dtype='datetime64[ns]', freq=None)
        
        **Other Parameters**
        
        Changed the `freq` (frequency) to ``'M'`` (month end frequency).
        
        >>> pd.date_range(start='1/1/2018', periods=5, freq='M')
        DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
                       '2018-05-31'],
                      dtype='datetime64[ns]', freq='M')
        
        Multiples are allowed
        
        >>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
        DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                       '2019-01-31'],
                      dtype='datetime64[ns]', freq='3M')
        
        `freq` can also be specified as an Offset object.
        
        >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
        DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                       '2019-01-31'],
                      dtype='datetime64[ns]', freq='3M')
        
        Specify `tz` to set the timezone.
        
        >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
        DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
                       '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
                       '2018-01-05 00:00:00+09:00'],
                      dtype='datetime64[ns, Asia/Tokyo]', freq='D')
        
        `closed` controls whether to include `start` and `end` that are on the
        boundary. The default includes boundary points on either end.
        
        >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
        DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
                      dtype='datetime64[ns]', freq='D')
        
        Use ``closed='left'`` to exclude `end` if it falls on the boundary.
        
        >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
        DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
                      dtype='datetime64[ns]', freq='D')
        
        Use ``closed='right'`` to exclude `start` if it falls on the boundary.
        
        >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
        DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
                      dtype='datetime64[ns]', freq='D')
    
    pd.date_range('2020-05-01',periods=5)
    
    DatetimeIndex(['2020-05-01', '2020-05-02', '2020-05-03', '2020-05-04',
                   '2020-05-05'],
                  dtype='datetime64[ns]', freq='D')
    
    longer_ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))
    
    longer_ts['2001']
    
    2001-01-01   -2.081715
    2001-01-02    1.425891
    2001-01-03    0.314430
    2001-01-04    0.153332
    2001-01-05    0.282888
                    ...   
    2001-12-27   -0.299994
    2001-12-28    1.852834
    2001-12-29    1.847192
    2001-12-30    0.592563
    2001-12-31    0.519122
    Freq: D, Length: 365, dtype: float64
    

    Here,the string '2001' is interpreted as a year and selects that time period.This also works if you specify the month:

    longer_ts['2001-05']
    
    2001-05-01   -0.408720
    2001-05-02   -0.958682
    2001-05-03   -0.424746
    2001-05-04    0.771404
    2001-05-05    1.959182
    2001-05-06    0.287984
    2001-05-07   -0.199789
    2001-05-08   -0.369938
    2001-05-09    0.309950
    2001-05-10   -1.649661
    2001-05-11    0.119676
    2001-05-12    0.205413
    2001-05-13    0.416938
    2001-05-14   -0.305450
    2001-05-15   -0.126385
    2001-05-16    1.665036
    2001-05-17    0.627492
    2001-05-18   -1.317637
    2001-05-19   -2.734170
    2001-05-20   -0.163745
    2001-05-21   -0.784528
    2001-05-22    0.151304
    2001-05-23    0.583916
    2001-05-24    0.571195
    2001-05-25   -1.498402
    2001-05-26   -1.485187
    2001-05-27    0.411882
    2001-05-28    0.323999
    2001-05-29    0.627545
    2001-05-30   -2.054165
    2001-05-31   -1.493494
    Freq: D, dtype: float64
    
    longer_ts['2000-05-01']
    
    -0.08042675710861961
    

    Slicing with datetime objects works as well:

    ts[datetime(2020,1,7)]
    
    -0.04616414830843706
    

    Because most time series data is ordered chronologically,you can slice with timestamps not contained in a time series to perform a range query:

    ts
    
    2020-01-02   -1.140949
    2020-01-05   -0.328999
    2020-01-07   -0.046164
    2020-01-08   -0.783714
    2020-01-10   -0.126047
    2020-01-12   -0.848602
    dtype: float64
    
    ts['2020-1-6':'2020-1-14'] # Notice that,'2020-1-6' and '2020-1-14' are not in ts.
    
    2020-01-07   -0.046164
    2020-01-08   -0.783714
    2020-01-10   -0.126047
    2020-01-12   -0.848602
    dtype: float64
    

    As before,you can pass either a string date,datetime,or timestamp.Remeber that slicing in this manner produces views on the source time series like slicing Numpy arrays.This means that no data is copied and modifications on the slice will be reflected in the original data.

    There is an equivalent instance method,truncate,that slices a Series between tow-dates:

    ts.truncate(after='1/9/2020')
    
    2020-01-02   -1.140949
    2020-01-05   -0.328999
    2020-01-07   -0.046164
    2020-01-08   -0.783714
    dtype: float64
    

    All of this holds true for DataFrame as well,indexing on its rows:

    dates=pd.date_range('1/1/2000',periods=100,freq='w-wed')
    
    long_df=pd.DataFrame(np.random.randn(100,4),index=dates,columns=['Colorado','Texas','New York','Ohio']) # Attention,type disscussed above is Series,and here,changed to be DataFrame
    
    long_df.loc['5-2001']
    
    Colorado Texas New York Ohio
    2001-05-02 -0.157680 -0.398869 0.399008 -0.109813
    2001-05-09 -0.475721 -0.550544 0.406308 0.290822
    2001-05-16 -1.101315 2.469916 -0.062604 -0.409562
    2001-05-23 0.972402 1.035692 -0.594960 1.255631
    2001-05-30 -1.823161 -0.327003 -0.294791 0.795953

    Time series with duplicate indices

    In some applications,there may be multiple data observations failling on a particular timestamp.

    dates=pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','1/3/2000'])
    
    dup_ts=pd.Series(np.arange(5),index=dates)
    
    dup_ts
    
    2000-01-01    0
    2000-01-02    1
    2000-01-02    2
    2000-01-02    3
    2000-01-03    4
    dtype: int32
    
    dup_ts.index.is_unique
    
    False
    

    Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:

    dup_ts['1/3/2000']
    
    4
    
    dup_ts['1/2/2000']
    
    2000-01-02    1
    2000-01-02    2
    2000-01-02    3
    dtype: int32
    

    Suppose you wanted to aggregate the data having non-unique timestamps.One way to do this is to use groupby and pass level=0:

    grouped=dup_ts.groupby(level=0)
    
    grouped.mean()
    
    2000-01-01    0
    2000-01-02    2
    2000-01-03    4
    dtype: int32
    
    grouped.count()
    
    2000-01-01    1
    2000-01-02    3
    2000-01-03    1
    dtype: int64
    

    Date ranges,frequencies,and shifting

    Generic time series in pandas are assumed to be irregular;that is,thet have no fixed frequency.For many applications this is sufficient.However,it's often desirable to work relative to a fixed frequency,such as daily,monthly,or every 15 minutes,even if that means introducing missing values into a time series. Fortunately pandas has a full suite of standard time series frequencies and tools for resampling,inferring frequencies,and generating fixed-frequency date ranges.

    ts
    
    2020-01-02   -1.140949
    2020-01-05   -0.328999
    2020-01-07   -0.046164
    2020-01-08   -0.783714
    2020-01-10   -0.126047
    2020-01-12   -0.848602
    dtype: float64
    
    resampler=ts.resample('D');resampler
    
    <pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20270320>
    

    The string 'D' is interpreted as a daily frequency.Conversion between frequencies or resampling is a big enough topic to have its own section.

    Generating date ranges

    pd.date_range is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency.

    index=pd.date_range('2012-04-01','2012-06-01')
    
    index
    
    DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
                   '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
                   '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
                   '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
                   '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
                   '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
                   '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
                   '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
                   '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
                   '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
                   '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
                   '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
                   '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
                   '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
                   '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
                   '2012-05-31', '2012-06-01'],
                  dtype='datetime64[ns]', freq='D')
    

    By default,date_range generates daily timestamps.If you pass only a start or end date,you must pass a number of periods to generate:

    pd.date_range(start='2012-04-01',periods=20)
    
    DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
                   '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
                   '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
                   '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
                   '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
                  dtype='datetime64[ns]', freq='D')
    
    pd.date_range(end='2012-06-1',periods=20)
    
    DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
                   '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
                   '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
                   '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
                   '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
                  dtype='datetime64[ns]', freq='D')
    

    The start and end dates define strict boundaries for the generated date index.For example,if you wanted a date index containing the last business day of each month,you would pass the BM frequency(business end of month) and only dates falling on or inside the date interval will be included:

    pd.date_range('2020-04-01','2020-07-01',freq='BM') #'BM'-->BusinessMonthEnd
    
    DatetimeIndex(['2020-04-30', '2020-05-29', '2020-06-30'], dtype='datetime64[ns]', freq='BM')
    
    pd.date_range('2020-05-01',periods=10,freq='w-fri') # 'w-mon,w-tue...'--->Weekly on given day of week
    
    DatetimeIndex(['2020-05-01', '2020-05-08', '2020-05-15', '2020-05-22',
                   '2020-05-29', '2020-06-05', '2020-06-12', '2020-06-19',
                   '2020-06-26', '2020-07-03'],
                  dtype='datetime64[ns]', freq='W-FRI')
    
    pd.date_range('2020-05-01',periods=3,freq='b') # 'b'--> BusinessDay
    
    DatetimeIndex(['2020-05-01', '2020-05-04', '2020-05-05'], dtype='datetime64[ns]', freq='B')
    
    pd.date_range('2020-05-01',periods=4,freq='q-feb')# quarterly dates anchored on last calendar day of each month.
    
    DatetimeIndex(['2020-05-31', '2020-08-31', '2020-11-30', '2021-02-28'], dtype='datetime64[ns]', freq='Q-FEB')
    
    pd.date_range('2020-05-01',periods=4,freq='q-mar')# to-->march for the last item
    
    DatetimeIndex(['2020-06-30', '2020-09-30', '2020-12-31', '2021-03-31'], dtype='datetime64[ns]', freq='Q-MAR')
    

    date_range by default preserves the time(if any) of the start or end timestamp.

    pd.date_range('2012-05-02 12:56:31',periods=5)
    
    DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
                   '2012-05-04 12:56:31', '2012-05-05 12:56:31',
                   '2012-05-06 12:56:31'],
                  dtype='datetime64[ns]', freq='D')
    

    Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention.To do this ,there is a normalize option:

    pd.date_range('2012-05-02 12:56:31',periods=5,normalize=True)
    
    DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
                   '2012-05-06'],
                  dtype='datetime64[ns]', freq='D')
    

    Frequencies and date offsets

    Frequencies in pandas are composed of a base frequency and a multiplier.Base frequencies are typically referred to by a string alias,like M for monthly or H for hourly.For each base frequency, there is an object defined generally referred to as a date offset.

    from pandas.tseries.offsets import Hour,Minute
    
    hour=Hour()
    
    hour
    
    <Hour>
    
    four_hour=Hour(4)
    
    four_hour
    
    <4 * Hours>
    

    In most applications,you would never need to explicitly create one of these objects,instead using a string alias like 'H' or '4H'.Putting an integer before the base frequency creates a multiple:

    pd.date_range('2000-01-01','2000-01-03 23:59',freq='4H')
    
    DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
                   '2000-01-01 08:00:00', '2000-01-01 12:00:00',
                   '2000-01-01 16:00:00', '2000-01-01 20:00:00',
                   '2000-01-02 00:00:00', '2000-01-02 04:00:00',
                   '2000-01-02 08:00:00', '2000-01-02 12:00:00',
                   '2000-01-02 16:00:00', '2000-01-02 20:00:00',
                   '2000-01-03 00:00:00', '2000-01-03 04:00:00',
                   '2000-01-03 08:00:00', '2000-01-03 12:00:00',
                   '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
                  dtype='datetime64[ns]', freq='4H')
    

    Many offsets can be combined together by addition:

    Hour(2)+Minute(30)
    
    <150 * Minutes>
    

    Similarly,you can pass frequency strings,like '1h30min',that will effectively be parsed to the same expression:

    pd.date_range('2000-01-01',periods=10,freq='1h30min')
    
    DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
                   '2000-01-01 03:00:00', '2000-01-01 04:30:00',
                   '2000-01-01 06:00:00', '2000-01-01 07:30:00',
                   '2000-01-01 09:00:00', '2000-01-01 10:30:00',
                   '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
                  dtype='datetime64[ns]', freq='90T')
    

    Some frequencies describe points in time that are not evenly spaced.For example,'M'(calendar month end) and 'BM'(last business/weekday of month) depend on the number of days in a month and, in the latter case ,whether the month ends on weekend or not.We refer to these as anchor offset.

    week of month dates

    One useful frequency class is 'week of month',starting with WOM.This enables you to get dates like the third Fridy of each month.

    rng=pd.date_range('2012-01-01','2012-09-01',freq='WOM-3FRI')
    
    list(rng)
    
    [Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
     Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
     Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
     Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
     Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
     Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
     Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
     Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]
    

    Shifting(Leading and lagging)data

    Shifting means moving data backward and forward through time.Both Series and DataFrame have a shift method for doing naive shifts forward,leaving the index unmodified:

    ts=pd.Series(np.random.randn(4),index=pd.date_range('1/1/2020',periods=4,freq='M'))
    
    help(ts.shift)
    
    Help on method shift in module pandas.core.series:
    
    shift(periods=1, freq=None, axis=0, fill_value=None) method of pandas.core.series.Series instance
        Shift index by desired number of periods with an optional time `freq`.
        
        When `freq` is not passed, shift the index without realigning the data.
        If `freq` is passed (in this case, the index must be date or datetime,
        or it will raise a `NotImplementedError`), the index will be
        increased using the periods and the `freq`.
        
        Parameters
        ----------
        periods : int
            Number of periods to shift. Can be positive or negative.
        freq : DateOffset, tseries.offsets, timedelta, or str, optional
            Offset to use from the tseries module or time rule (e.g. 'EOM').
            If `freq` is specified then the index values are shifted but the
            data is not realigned. That is, use `freq` if you would like to
            extend the index when shifting and preserve the original data.
        axis : {0 or 'index', 1 or 'columns', None}, default None
            Shift direction.
        fill_value : object, optional
            The scalar value to use for newly introduced missing values.
            the default depends on the dtype of `self`.
            For numeric data, ``np.nan`` is used.
            For datetime, timedelta, or period data, etc. :attr:`NaT` is used.
            For extension dtypes, ``self.dtype.na_value`` is used.
        
            .. versionchanged:: 0.24.0
        
        Returns
        -------
        Series
            Copy of input object, shifted.
        
        See Also
        --------
        Index.shift : Shift values of Index.
        DatetimeIndex.shift : Shift values of DatetimeIndex.
        PeriodIndex.shift : Shift values of PeriodIndex.
        tshift : Shift the time index, using the index's frequency if
            available.
        
        Examples
        --------
        >>> df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],
        ...                    'Col2': [13, 23, 18, 33, 48],
        ...                    'Col3': [17, 27, 22, 37, 52]})
        
        >>> df.shift(periods=3)
           Col1  Col2  Col3
        0   NaN   NaN   NaN
        1   NaN   NaN   NaN
        2   NaN   NaN   NaN
        3  10.0  13.0  17.0
        4  20.0  23.0  27.0
        
        >>> df.shift(periods=1, axis='columns')
           Col1  Col2  Col3
        0   NaN  10.0  13.0
        1   NaN  20.0  23.0
        2   NaN  15.0  18.0
        3   NaN  30.0  33.0
        4   NaN  45.0  48.0
        
        >>> df.shift(periods=3, fill_value=0)
           Col1  Col2  Col3
        0     0     0     0
        1     0     0     0
        2     0     0     0
        3    10    13    17
        4    20    23    27
    
    ts
    
    2020-01-31    0.225376
    2020-02-29   -0.024710
    2020-03-31    0.117686
    2020-04-30    1.513727
    Freq: M, dtype: float64
    
    ts.shift(2)
    
    2020-01-31         NaN
    2020-02-29         NaN
    2020-03-31    0.225376
    2020-04-30   -0.024710
    Freq: M, dtype: float64
    
    ts.shift(-2)
    
    2020-01-31    0.117686
    2020-02-29    1.513727
    2020-03-31         NaN
    2020-04-30         NaN
    Freq: M, dtype: float64
    

    When we shift like this,missing data is introduced either at the start or the end of the time Series.

    A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns.

    Because naive shift leave the index unmodified,some data is discard.Thus if the frequency is known ,it can be passed to shift to advance the timestamps instead of simply the data:
    Shift index by desired number of periods with an optional time freq.

    When freq is not passed, shift the index without realigning the data.
    If freq is passed (in this case, the index must be date or datetime,
    or it will raise a NotImplementedError), the index will be
    increased using the periods and the freq.

    ts.shift(2,freq='M') 
    
    2020-03-31    0.225376
    2020-04-30   -0.024710
    2020-05-31    0.117686
    2020-06-30    1.513727
    Freq: M, dtype: float64
    
    ts
    
    2020-01-31    0.225376
    2020-02-29   -0.024710
    2020-03-31    0.117686
    2020-04-30    1.513727
    Freq: M, dtype: float64
    

    Other frequencies can be passed too,giving you some flexibility in how to lead and lag the data:

    ts.shift(3,freq='D') # shift every timestamp in ts forward 3 days
    
    2020-02-03    0.225376
    2020-03-03   -0.024710
    2020-04-03    0.117686
    2020-05-03    1.513727
    dtype: float64
    
    ts.shift(1,freq='90T')  # shift every timestamp in ts forward '90T'
    
    2020-01-31 01:30:00    0.225376
    2020-02-29 01:30:00   -0.024710
    2020-03-31 01:30:00    0.117686
    2020-04-30 01:30:00    1.513727
    Freq: M, dtype: float64
    

    Shift dates with offsets

    The pandas date offsets can also be used with datetime or Timestamp objects:

    from pandas.tseries.offsets import Day,MonthEnd
    
    now=datetime(2020,11,17)
    
    now+3*Day()
    
    Timestamp('2020-11-20 00:00:00')
    
    now+MonthEnd(2)
    
    Timestamp('2020-12-31 00:00:00')
    

    Anchored offsets can explicitly 'roll' dates forward or backward by simply using their rollforward and rollback methods,respectively:

    offset=MonthEnd()
    
    offset.rollforward(now)
    
    Timestamp('2020-11-30 00:00:00')
    
    offset.rollback(now)
    
    Timestamp('2020-10-31 00:00:00')
    

    A creative use of date offset is to use these methods with groupby:

    ts=pd.Series(np.random.randn(20),index=pd.date_range('1/15/2000',periods=20,freq='4d'))
    

    ts

    list(ts.groupby(offset.rollforward))
    
    [(Timestamp('2000-01-31 00:00:00'),
      2000-01-15   -0.209800
      2000-01-19   -2.189881
      2000-01-23   -1.779681
      2000-01-27    0.437441
      2000-01-31    1.054685
      Freq: 4D, dtype: float64),
     (Timestamp('2000-02-29 00:00:00'),
      2000-02-04   -0.506648
      2000-02-08    0.484109
      2000-02-12   -0.385587
      2000-02-16   -0.732983
      2000-02-20   -1.459167
      2000-02-24   -1.133808
      2000-02-28    0.097860
      Freq: 4D, dtype: float64),
     (Timestamp('2000-03-31 00:00:00'),
      2000-03-03    0.480492
      2000-03-07    1.040105
      2000-03-11    0.634999
      2000-03-15    0.621187
      2000-03-19   -1.410100
      2000-03-23    0.319765
      2000-03-27   -1.079803
      2000-03-31   -1.292514
      Freq: 4D, dtype: float64)]
    
    ts.groupby(offset.rollforward).mean()
    
    2000-01-31   -0.537447
    2000-02-29   -0.519461
    2000-03-31   -0.085734
    dtype: float64
    

    Of course ,an easier and faster way to do this is using resample

    ts.resample('M').mean()
    
    2000-01-31   -0.537447
    2000-02-29   -0.519461
    2000-03-31   -0.085734
    Freq: M, dtype: float64
    

    Time zone handling

    Working with time zone is generally considered one of the most unpleasant parts of time series manipulation.As a result,many time series users choose to work with time series in coordinated universal time or UTC,which is the successor to Greenwich Mean Time and is the current internaational standard.Time zones are expressed as offsets from UTC;

    In Python,time zone information comes from the third-party pytz library,which exposes the Olson database,a compilation of world time zone information. Pandas wraps pytz's funtionality so you can ignore its API outside of the time zone names. Time zone names can be found interactively and in the docs.

    import pytz
    
    pytz.common_timezones[-5:]
    
    ['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']
    
    len(pytz.common_timezones)
    
    440
    

    To get a time zone object from pytz,use pytz.timezone:

    tz=pytz.timezone('America/New_York')
    
    tz
    
    <DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>
    

    Time zone localization and conversion

    By default,time series in pandas are time zone naive, for example,consider the following time series:

    rng=pd.date_range('3/9/2020 9:00',periods=6,freq='D')
    
    ts=pd.Series(np.random.randn(len(rng)),index=rng)
    
    ts
    
    2020-03-09 09:00:00   -0.384109
    2020-03-10 09:00:00   -0.195272
    2020-03-11 09:00:00   -0.473277
    2020-03-12 09:00:00    1.430223
    2020-03-13 09:00:00   -0.222399
    2020-03-14 09:00:00   -0.844174
    Freq: D, dtype: float64
    
    print(ts.index.tz) # print None,indicating that ts.index is naive time zone
    
    None
    

    The index's field is None. Date ranges can be generated with a time zone set:

    pd.date_range('3/9/2020 9:30',periods=10,freq='D',tz='UTC')
    
    DatetimeIndex(['2020-03-09 09:30:00+00:00', '2020-03-10 09:30:00+00:00',
                   '2020-03-11 09:30:00+00:00', '2020-03-12 09:30:00+00:00',
                   '2020-03-13 09:30:00+00:00', '2020-03-14 09:30:00+00:00',
                   '2020-03-15 09:30:00+00:00', '2020-03-16 09:30:00+00:00',
                   '2020-03-17 09:30:00+00:00', '2020-03-18 09:30:00+00:00'],
                  dtype='datetime64[ns, UTC]', freq='D')
    

    Conversion from naive to localized is handled by tz_localize:

    ts
    
    2020-03-09 09:00:00   -0.384109
    2020-03-10 09:00:00   -0.195272
    2020-03-11 09:00:00   -0.473277
    2020-03-12 09:00:00    1.430223
    2020-03-13 09:00:00   -0.222399
    2020-03-14 09:00:00   -0.844174
    Freq: D, dtype: float64
    
    ts_utc=ts.tz_localize('utc')
    
    ts_utc
    
    2020-03-09 09:00:00+00:00   -0.384109
    2020-03-10 09:00:00+00:00   -0.195272
    2020-03-11 09:00:00+00:00   -0.473277
    2020-03-12 09:00:00+00:00    1.430223
    2020-03-13 09:00:00+00:00   -0.222399
    2020-03-14 09:00:00+00:00   -0.844174
    Freq: D, dtype: float64
    
    ts_utc.index # dtype has been changed to be UTC
    
    DatetimeIndex(['2020-03-09 09:00:00+00:00', '2020-03-10 09:00:00+00:00',
                   '2020-03-11 09:00:00+00:00', '2020-03-12 09:00:00+00:00',
                   '2020-03-13 09:00:00+00:00', '2020-03-14 09:00:00+00:00'],
                  dtype='datetime64[ns, UTC]', freq='D')
    

    Operations with Time zone --Aware Timestamp objects

    Similar to time series and date ranges,individual Timestamp objects similarly can be localized from naive to time-aware and converted from one time zone to another:

    stamp=pd.Timestamp('2011-03-12 04:00')
    
    print(stamp.tz)
    
    None
    
    stamp_utc=stamp.tz_localize('utc')
    
    stamp_utc.tz_convert('America/New_York')
    
    Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')
    
    ts.index.tz_localize('utc').tz_convert('America/New_York') # naive shall be localized and then can be converted to anoter time zone
    
    DatetimeIndex(['2020-03-09 05:00:00-04:00', '2020-03-10 05:00:00-04:00',
                   '2020-03-11 05:00:00-04:00', '2020-03-12 05:00:00-04:00',
                   '2020-03-13 05:00:00-04:00', '2020-03-14 05:00:00-04:00'],
                  dtype='datetime64[ns, America/New_York]', freq='D')
    
    ts.index.tz_localize('utc').tz
    
    <UTC>
    

    You can also pass a time zone when creating the Timestamp:

    stamp_mscow=pd.Timestamp('2011-03-12 04:00',tz='Europe/Moscow')
    
    stamp_mscow.tz
    
    <DstTzInfo 'Europe/Moscow' MSK+3:00:00 STD>
    

    Time zone-aware Timestamp objects internally store a UTC timestamp value as nano-seconds since the Unix-epoch(January 1,1970);this UTC value is invariant between time zone conversions:

    stamp_mscow.value
    
    1299891600000000000
    
    stamp_mscow.tz_convert('America/New_York')
    
    Timestamp('2011-03-11 20:00:00-0500', tz='America/New_York')
    

    When performing time arithmetic using pandas's DateOffset objects,pandas respects daylight saving time transitions where possible.Here we construct timestamps that occur right before DST transitions.First,30 minutes before transitioning to DST:

    from pandas.tseries.offsets import Hour
    
    stamp=pd.Timestamp('2012-03-12 01:30',tz='US/Eastern')
    
    stamp
    
    Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')
    
    stamp+Hour()
    
    Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')
    

    Then ,90 minutes before transitioning out of DST:

    stamp=pd.Timestamp('2012-10-04 00:30',tz='US/Eastern')
    
    stamp
    
    Timestamp('2012-10-04 00:30:00-0400', tz='US/Eastern')
    
    stamp+2*Hour()  ### 02:30
    
    Timestamp('2012-10-04 02:30:00-0400', tz='US/Eastern')
    
    stamp=pd.Timestamp('2012-11-04 00:30',tz='US/Eastern')
    
    stamp
    
    Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')
    
    stamp+2*Hour()  #01:30
    
    Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')
    

    Operations between different time zones

    If two time series with different time zones are combined,the result will be UTC.Since the timestamps are stored under the hood in UTC,this is a straightforward operation and requires no conversion to happen:

    rng=pd.date_range('3/7/2012 9:30',periods=10,freq='B')
    
    ts=pd.Series(np.random.randn(len(rng)),index=rng)
    
    ts1=ts[:7].tz_localize('Europe/London')
    
    ts2=ts1[2:].tz_convert('Europe/Moscow')
    
    result=ts1+ts2
    
    result.index
    
    DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
                   '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
                   '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
                   '2012-03-15 09:30:00+00:00'],
                  dtype='datetime64[ns, UTC]', freq='B')
    

    Periods and period arithmetic

    Periods represent timespans,like days,months,quarters,or years.The Periodclass represents this data type,requiring a string or integer.

    p=pd.Period(2007,freq='A-DEC') #Annual dates anchored on last calendar day of given month,yearEnd
    
    p
    
    Period('2007', 'A-DEC')
    

    In this case,the Period object represents the full timespan from January1,2007,to December 31,2007,inclusive.Conveniently,adding integers from periods has the effect of shiftting by their frequency:

    p+5
    
    Period('2012', 'A-DEC')
    
    p-2
    
    Period('2005', 'A-DEC')
    

    If two periods have the same frequency,their difference is the number of units between them:

    pd.Period('2014','A-DEC')-p
    
    <7 * YearEnds: month=12>
    

    Regular ranges of periods can be constructed with the period_range function:

    rng=pd.period_range('2000-01-01','2000-06-30',freq='M')
    
    rng
    
    PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')
    

    The PeriodIndex class stores a sequence of periods and can serve as an axis index in any pandas data structure:

    pd.Series(np.random.randn(6),index=rng)
    
    2000-01    1.337147
    2000-02   -0.201512
    2000-03   -0.261829
    2000-04    0.124229
    2000-05   -0.723703
    2000-06   -2.130917
    Freq: M, dtype: float64
    

    If you have an array of strings,you can also use the PeriodIndex class:

    values=['2001Q3','2002Q2','2003Q1']
    
    index=pd.PeriodIndex(values,freq='Q-DEC') # Quarterly dates anchored on last calendar day of each month,for year ending in indicated month.
    
    index
    
    PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
    

    Period frequency convension

    Periods and PeriodIndex objects can be converted to another frequency with their asfreq method.As an example,suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year.This is fairly straightforward.

    p=pd.Period('2007',freq='A-DEC') # Annual period,ending with the last day of 2007-12
    
    p
    
    Period('2007', 'A-DEC')
    
    p.asfreq('M',how='start')
    
    Period('2007-01', 'M')
    
    p.asfreq('M',how='end')
    
    Period('2007-12', 'M')
    

    You can think of Period('2007','A-DEC') as being a sort of cursor pointing to a span of time,subdivided by monthly periods.

    p=pd.Period('2007','A-JUN') #Annual period,ending with the last day of 2007-06
    
    p.asfreq('M','start')
    
    Period('2006-07', 'M')
    
    p.asfreq('M','end')
    
    Period('2007-06', 'M')
    

    Whole PeriodIndex objects or time series can be similarly converted with the same semantics:

    rng=pd.period_range('2006','2009',freq='A-DEC')
    
    ts=pd.Series(np.random.randn(len(rng)),index=rng)
    
    ts
    
    2006   -0.297145
    2007   -0.304496
    2008    0.705818
    2009   -1.829369
    Freq: A-DEC, dtype: float64
    
    ts.asfreq('M',how='start')
    
    2006-01   -0.297145
    2007-01   -0.304496
    2008-01    0.705818
    2009-01   -1.829369
    Freq: M, dtype: float64
    

    Here,the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period.If we instead wanted the last business day of each year, we can use 'B' frequency and indicate that we want the end of the period:

    ts.asfreq('B',how='end')
    
    2006-12-29   -0.297145
    2007-12-31   -0.304496
    2008-12-31    0.705818
    2009-12-31   -1.829369
    Freq: B, dtype: float64
    

    Quarterly period frequencies

    Quarterly data is stanard in accounting,finance,and other fields.Much quarterly data is reported relative to a fiscal year end,typically the last calender or business day of one of the 12 months of the year.Thus,the period 2012Q4 has a different meaning depending on fiscal year end.pandas supports all 12 possible quarterly frequencies as Q-JAN through Q-DEC.

    p=pd.Period('2012Q4',freq='Q-JAN')
    
    p
    
    Period('2012Q4', 'Q-JAN')
    

    In the case of fiscal year ending in January. 2012Q4 runs from November through January,whcin you can check by converting to daily frequency.

    p.asfreq('D','start') #check p by converting to daily frequency.
    
    Period('2011-11-01', 'D')
    
    p.asfreq('D','end')
    
    Period('2012-01-31', 'D')
    

    Thus,it's possible to do easy period arithmetic,for example,to get the timestamp at 4PM on the second-to-last business day of the quarter,

    p4pm=(p.asfreq('B','e')-1).asfreq('T','s')+16*60  #'T' means 'Minute'
    
    p4pm
    
    Period('2012-01-30 16:00', 'T')
    
    p4pm.to_timestamp()
    
    Timestamp('2012-01-30 16:00:00')
    

    You can generate quarterly ranges using period_range.

    rng=pd.period_range('2011Q3','2012Q4',freq='Q-JAN')
    
    ts=pd.Series(np.arange(len(rng)),index=rng)
    
    ts
    
    2011Q3    0
    2011Q4    1
    2012Q1    2
    2012Q2    3
    2012Q3    4
    2012Q4    5
    Freq: Q-JAN, dtype: int32
    
    new_rng=(rng.asfreq('B','e')-1).asfreq('T','s')+16*60
    
    ts.index=new_rng.to_timestamp()
    
    ts
    
    2010-10-28 16:00:00    0
    2011-01-28 16:00:00    1
    2011-04-28 16:00:00    2
    2011-07-28 16:00:00    3
    2011-10-28 16:00:00    4
    2012-01-30 16:00:00    5
    dtype: int32
    

    Converting Timestamps to Periods(and Back

    Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period method:

    rng=pd.date_range('2000-01-01',periods=3,freq='M')
    
    ts=pd.Series(np.random.randn(3),index=rng);ts
    
    2000-01-31   -0.115457
    2000-02-29   -0.318769
    2000-03-31    0.166398
    Freq: M, dtype: float64
    
    pts=ts.to_period();pts
    
    2000-01   -0.115457
    2000-02   -0.318769
    2000-03    0.166398
    Freq: M, dtype: float64
    
    type(pts)
    
    pandas.core.series.Series
    

    Since periods refer to non-overlapping timespans,a timestamp can only belong to a single period for a given frequency.While the frequency of the new PeriodIndex is inferred from the timestamp by default,you can specify any frequency you want.

    rng=pd.date_range('1/29/2000',periods=6,freq='D')
    
    ts2=pd.Series(np.random.randn(6),index=rng)
    
    ts2
    
    2000-01-29    0.511537
    2000-01-30    2.661260
    2000-01-31    0.954388
    2000-02-01   -0.903825
    2000-02-02   -0.399345
    2000-02-03    1.160727
    Freq: D, dtype: float64
    
    ts2.to_period('M')
    
    2000-01    0.511537
    2000-01    2.661260
    2000-01    0.954388
    2000-02   -0.903825
    2000-02   -0.399345
    2000-02    1.160727
    Freq: M, dtype: float64
    

    To convert back to timestamps,use to_timestamp:

    pts=ts2.to_period()
    
    pts.to_timestamp(how='end')
    
    2000-01-29 23:59:59.999999999    0.511537
    2000-01-30 23:59:59.999999999    2.661260
    2000-01-31 23:59:59.999999999    0.954388
    2000-02-01 23:59:59.999999999   -0.903825
    2000-02-02 23:59:59.999999999   -0.399345
    2000-02-03 23:59:59.999999999    1.160727
    Freq: D, dtype: float64
    

    Creating a PeriodIndex from Arrays

    data=pd.read_csv('.pydata-book-2nd-editionexamplesmacrodata.csv')
    
    data.head(5)
    
    year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
    0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
    1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
    2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
    3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
    4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
    data.year
    
    0      1959.0
    1      1959.0
    2      1959.0
    3      1959.0
    4      1960.0
            ...  
    198    2008.0
    199    2008.0
    200    2009.0
    201    2009.0
    202    2009.0
    Name: year, Length: 203, dtype: float64
    
    data.quarter
    
    0      1.0
    1      2.0
    2      3.0
    3      4.0
    4      1.0
          ... 
    198    3.0
    199    4.0
    200    1.0
    201    2.0
    202    3.0
    Name: quarter, Length: 203, dtype: float64
    

    By passing these arrays to PeriodIndex with a frequency,you can combine them to form an index for the DataFrame:

    index=pd.PeriodIndex(year=data.year,quarter=data.quarter,freq='Q-DEC')
    
    index
    
    PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
                 '1960Q3', '1960Q4', '1961Q1', '1961Q2',
                 ...
                 '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
                 '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
                dtype='period[Q-DEC]', length=203, freq='Q-DEC')
    
    data.index=index
    
    
    data.infl
    
    1959Q1    0.00
    1959Q2    2.34
    1959Q3    2.74
    1959Q4    0.27
    1960Q1    2.31
              ... 
    2008Q3   -3.16
    2008Q4   -8.79
    2009Q1    0.94
    2009Q2    3.37
    2009Q3    3.56
    Freq: Q-DEC, Name: infl, Length: 203, dtype: float64
    

    Resampling and frequency conversion

    Resampling refers to the process of converting a time series from one frequency to another.Aggregating higher frequency data to lower frequency is called downsampling, while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories;

    pandas objects are equipped with a resample method,which is the workhourse function for all frequency conversion.resample has a similar API to groupby;you can call resemple to group the data,then call an aggregate functions:

    rng=pd.date_range('2000-01-01',periods=5,freq='D')
    
    ts=pd.Series(np.arange(5),index=rng)
    
    ts.head()
    
    2000-01-01    0
    2000-01-02    1
    2000-01-03    2
    2000-01-04    3
    2000-01-05    4
    Freq: D, dtype: int32
    
    ts.resample('M').mean()
    
    2000-01-31    2
    Freq: M, dtype: int32
    
    ts.resample('M',kind='period').mean()
    
    2000-01    2
    Freq: M, dtype: int32
    
    help(pd.Series.resample)
    
    Help on function resample in module pandas.core.generic:
    
    resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)
        Resample time-series data.
        
        Convenience method for frequency conversion and resampling of time
        series. Object must have a datetime-like index (`DatetimeIndex`,
        `PeriodIndex`, or `TimedeltaIndex`), or pass datetime-like values
        to the `on` or `level` keyword.
        
        Parameters
        ----------
        rule : DateOffset, Timedelta or str
            The offset string or object representing target conversion.
        how : str
            Method for down/re-sampling, default to 'mean' for downsampling.
        
            .. deprecated:: 0.18.0
               The new syntax is ``.resample(...).mean()``, or
               ``.resample(...).apply(<func>)``
        axis : {0 or 'index', 1 or 'columns'}, default 0
            Which axis to use for up- or down-sampling. For `Series` this
            will default to 0, i.e. along the rows. Must be
            `DatetimeIndex`, `TimedeltaIndex` or `PeriodIndex`.
        fill_method : str, default None
            Filling method for upsampling.
        
            .. deprecated:: 0.18.0
               The new syntax is ``.resample(...).<func>()``,
               e.g. ``.resample(...).pad()``
        closed : {'right', 'left'}, default None
            Which side of bin interval is closed. The default is 'left'
            for all frequency offsets except for 'M', 'A', 'Q', 'BM',
            'BA', 'BQ', and 'W' which all have a default of 'right'.
        label : {'right', 'left'}, default None
            Which bin edge label to label bucket with. The default is 'left'
            for all frequency offsets except for 'M', 'A', 'Q', 'BM',
            'BA', 'BQ', and 'W' which all have a default of 'right'.
        convention : {'start', 'end', 's', 'e'}, default 'start'
            For `PeriodIndex` only, controls whether to use the start or
            end of `rule`.
        kind : {'timestamp', 'period'}, optional, default None
            Pass 'timestamp' to convert the resulting index to a
            `DateTimeIndex` or 'period' to convert it to a `PeriodIndex`.
            By default the input representation is retained.
        loffset : timedelta, default None
            Adjust the resampled time labels.
        limit : int, default None
            Maximum size gap when reindexing with `fill_method`.
        
            .. deprecated:: 0.18.0
        base : int, default 0
            For frequencies that evenly subdivide 1 day, the "origin" of the
            aggregated intervals. For example, for '5min' frequency, base could
            range from 0 through 4. Defaults to 0.
        on : str, optional
            For a DataFrame, column to use instead of index for resampling.
            Column must be datetime-like.
        
            .. versionadded:: 0.19.0
        
        level : str or int, optional
            For a MultiIndex, level (name or number) to use for
            resampling. `level` must be datetime-like.
        
            .. versionadded:: 0.19.0
        
        Returns
        -------
        Resampler object
        
        See Also
        --------
        groupby : Group by mapping, function, label, or list of labels.
        Series.resample : Resample a Series.
        DataFrame.resample: Resample a DataFrame.
        
        Notes
        -----
        See the `user guide
        <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling>`_
        for more.
        
        To learn more about the offset strings, please see `this link
        <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects>`__.
        
        Examples
        --------
        
        Start by creating a series with 9 one minute timestamps.
        
        >>> index = pd.date_range('1/1/2000', periods=9, freq='T')
        >>> series = pd.Series(range(9), index=index)
        >>> series
        2000-01-01 00:00:00    0
        2000-01-01 00:01:00    1
        2000-01-01 00:02:00    2
        2000-01-01 00:03:00    3
        2000-01-01 00:04:00    4
        2000-01-01 00:05:00    5
        2000-01-01 00:06:00    6
        2000-01-01 00:07:00    7
        2000-01-01 00:08:00    8
        Freq: T, dtype: int64
        
        Downsample the series into 3 minute bins and sum the values
        of the timestamps falling into a bin.
        
        >>> series.resample('3T').sum()
        2000-01-01 00:00:00     3
        2000-01-01 00:03:00    12
        2000-01-01 00:06:00    21
        Freq: 3T, dtype: int64
        
        Downsample the series into 3 minute bins as above, but label each
        bin using the right edge instead of the left. Please note that the
        value in the bucket used as the label is not included in the bucket,
        which it labels. For example, in the original series the
        bucket ``2000-01-01 00:03:00`` contains the value 3, but the summed
        value in the resampled bucket with the label ``2000-01-01 00:03:00``
        does not include 3 (if it did, the summed value would be 6, not 3).
        To include this value close the right side of the bin interval as
        illustrated in the example below this one.
        
        >>> series.resample('3T', label='right').sum()
        2000-01-01 00:03:00     3
        2000-01-01 00:06:00    12
        2000-01-01 00:09:00    21
        Freq: 3T, dtype: int64
        
        Downsample the series into 3 minute bins as above, but close the right
        side of the bin interval.
        
        >>> series.resample('3T', label='right', closed='right').sum()
        2000-01-01 00:00:00     0
        2000-01-01 00:03:00     6
        2000-01-01 00:06:00    15
        2000-01-01 00:09:00    15
        Freq: 3T, dtype: int64
        
        Upsample the series into 30 second bins.
        
        >>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows
        2000-01-01 00:00:00   0.0
        2000-01-01 00:00:30   NaN
        2000-01-01 00:01:00   1.0
        2000-01-01 00:01:30   NaN
        2000-01-01 00:02:00   2.0
        Freq: 30S, dtype: float64
        
        Upsample the series into 30 second bins and fill the ``NaN``
        values using the ``pad`` method.
        
        >>> series.resample('30S').pad()[0:5]
        2000-01-01 00:00:00    0
        2000-01-01 00:00:30    0
        2000-01-01 00:01:00    1
        2000-01-01 00:01:30    1
        2000-01-01 00:02:00    2
        Freq: 30S, dtype: int64
        
        Upsample the series into 30 second bins and fill the
        ``NaN`` values using the ``bfill`` method.
        
        >>> series.resample('30S').bfill()[0:5]
        2000-01-01 00:00:00    0
        2000-01-01 00:00:30    1
        2000-01-01 00:01:00    1
        2000-01-01 00:01:30    2
        2000-01-01 00:02:00    2
        Freq: 30S, dtype: int64
        
        Pass a custom function via ``apply``
        
        >>> def custom_resampler(array_like):
        ...     return np.sum(array_like) + 5
        ...
        >>> series.resample('3T').apply(custom_resampler)
        2000-01-01 00:00:00     8
        2000-01-01 00:03:00    17
        2000-01-01 00:06:00    26
        Freq: 3T, dtype: int64
        
        For a Series with a PeriodIndex, the keyword `convention` can be
        used to control whether to use the start or end of `rule`.
        
        Resample a year by quarter using 'start' `convention`. Values are
        assigned to the first quarter of the period.
        
        >>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
        ...                                             freq='A',
        ...                                             periods=2))
        >>> s
        2012    1
        2013    2
        Freq: A-DEC, dtype: int64
        >>> s.resample('Q', convention='start').asfreq()
        2012Q1    1.0
        2012Q2    NaN
        2012Q3    NaN
        2012Q4    NaN
        2013Q1    2.0
        2013Q2    NaN
        2013Q3    NaN
        2013Q4    NaN
        Freq: Q-DEC, dtype: float64
        
        Resample quarters by month using 'end' `convention`. Values are
        assigned to the last month of the period.
        
        >>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
        ...                                                   freq='Q',
        ...                                                   periods=4))
        >>> q
        2018Q1    1
        2018Q2    2
        2018Q3    3
        2018Q4    4
        Freq: Q-DEC, dtype: int64
        >>> q.resample('M', convention='end').asfreq()
        2018-03    1.0
        2018-04    NaN
        2018-05    NaN
        2018-06    2.0
        2018-07    NaN
        2018-08    NaN
        2018-09    3.0
        2018-10    NaN
        2018-11    NaN
        2018-12    4.0
        Freq: M, dtype: float64
        
        For DataFrame objects, the keyword `on` can be used to specify the
        column instead of the index for resampling.
        
        >>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
        ...           'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
        >>> df = pd.DataFrame(d)
        >>> df['week_starting'] = pd.date_range('01/01/2018',
        ...                                     periods=8,
        ...                                     freq='W')
        >>> df
           price  volume week_starting
        0     10      50    2018-01-07
        1     11      60    2018-01-14
        2      9      40    2018-01-21
        3     13     100    2018-01-28
        4     14      50    2018-02-04
        5     18     100    2018-02-11
        6     17      40    2018-02-18
        7     19      50    2018-02-25
        >>> df.resample('M', on='week_starting').mean()
                       price  volume
        week_starting
        2018-01-31     10.75    62.5
        2018-02-28     17.00    60.0
        
        For a DataFrame with MultiIndex, the keyword `level` can be used to
        specify on which level the resampling needs to take place.
        
        >>> days = pd.date_range('1/1/2000', periods=4, freq='D')
        >>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
        ...            'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
        >>> df2 = pd.DataFrame(d2,
        ...                    index=pd.MultiIndex.from_product([days,
        ...                                                     ['morning',
        ...                                                      'afternoon']]
        ...                                                     ))
        >>> df2
                              price  volume
        2000-01-01 morning       10      50
                   afternoon     11      60
        2000-01-02 morning        9      40
                   afternoon     13     100
        2000-01-03 morning       14      50
                   afternoon     18     100
        2000-01-04 morning       17      40
                   afternoon     19      50
        >>> df2.resample('D', level=0).sum()
                    price  volume
        2000-01-01     21     110
        2000-01-02     22     140
        2000-01-03     32     150
        2000-01-04     36      90
    
    help(pd.MultiIndex.from_product)
    
    Help on method from_product in module pandas.core.indexes.multi:
    
    from_product(iterables, sortorder=None, names=None) method of builtins.type instance
        Make a MultiIndex from the cartesian product of multiple iterables.
        
        Parameters
        ----------
        iterables : list / sequence of iterables
            Each iterable has unique labels for each level of the index.
        sortorder : int or None
            Level of sortedness (must be lexicographically sorted by that
            level).
        names : list / sequence of str, optional
            Names for the levels in the index.
        
        Returns
        -------
        index : MultiIndex
        
        See Also
        --------
        MultiIndex.from_arrays : Convert list of arrays to MultiIndex.
        MultiIndex.from_tuples : Convert list of tuples to MultiIndex.
        MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
        
        Examples
        --------
        >>> numbers = [0, 1, 2]
        >>> colors = ['green', 'purple']
        >>> pd.MultiIndex.from_product([numbers, colors],
        ...                            names=['number', 'color'])
        MultiIndex([(0,  'green'),
                    (0, 'purple'),
                    (1,  'green'),
                    (1, 'purple'),
                    (2,  'green'),
                    (2, 'purple')],
                   names=['number', 'color'])
    
    tngt=pd.date_range('1/1/2020',periods=9,freq='T')
    
    tngt
    
    DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
                   '2020-01-01 00:02:00', '2020-01-01 00:03:00',
                   '2020-01-01 00:04:00', '2020-01-01 00:05:00',
                   '2020-01-01 00:06:00', '2020-01-01 00:07:00',
                   '2020-01-01 00:08:00'],
                  dtype='datetime64[ns]', freq='T')
    
    sert=pd.Series(np.arange(9),index=tngt)
    
    sert
    
    2020-01-01 00:00:00    0
    2020-01-01 00:01:00    1
    2020-01-01 00:02:00    2
    2020-01-01 00:03:00    3
    2020-01-01 00:04:00    4
    2020-01-01 00:05:00    5
    2020-01-01 00:06:00    6
    2020-01-01 00:07:00    7
    2020-01-01 00:08:00    8
    Freq: T, dtype: int32
    
    sert.resample('30S')
    
    <pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20423860>
    
    help(pd.core.resample.DatetimeIndexResampler.asfreq)
    
    Help on function asfreq in module pandas.core.resample:
    
    asfreq(self, fill_value=None)
        Return the values at the new freq, essentially a reindex.
        
        Parameters
        ----------
        fill_value : scalar, optional
            Value to use for missing values, applied during upsampling (note
            this does not fill NaNs that already were present).
        
            .. versionadded:: 0.20.0
        
        Returns
        -------
        DataFrame or Series
            Values at the specified freq.
        
        See Also
        --------
        Series.asfreq
        DataFrame.asfreq
    
    sert.resample('30S').asfreq()
    
    2020-01-01 00:00:00    0.0
    2020-01-01 00:00:30    NaN
    2020-01-01 00:01:00    1.0
    2020-01-01 00:01:30    NaN
    2020-01-01 00:02:00    2.0
    2020-01-01 00:02:30    NaN
    2020-01-01 00:03:00    3.0
    2020-01-01 00:03:30    NaN
    2020-01-01 00:04:00    4.0
    2020-01-01 00:04:30    NaN
    2020-01-01 00:05:00    5.0
    2020-01-01 00:05:30    NaN
    2020-01-01 00:06:00    6.0
    2020-01-01 00:06:30    NaN
    2020-01-01 00:07:00    7.0
    2020-01-01 00:07:30    NaN
    2020-01-01 00:08:00    8.0
    Freq: 30S, dtype: float64
    
    ts.resample('M',kind='period').mean()
    
    2000-01    2
    Freq: M, dtype: int32
    
    ts
    
    2000-01-01    0
    2000-01-02    1
    2000-01-03    2
    2000-01-04    3
    2000-01-05    4
    Freq: D, dtype: int32
    
    ts.resample('M',kind='timestamp').mean()
    
    2000-01-31    2
    Freq: M, dtype: int32
    
    ts.resample('M').mean()
    
    2000-01-31    2
    Freq: M, dtype: int32
    

    resample is a flexible and high-performance method that can be used to process very large time series.

    The argument and its meaning:

    • freq: string or DateOffset indicating desired resampled frequency(e.g 'M','5min' ,or Second(15)
    • axis : axis to resample on;default axis=0
    • fill_method: how to interpolate when upsampling,as in 'ffill' or 'bfill'; by default does not interpolation
    • closed: in downsampling,which end of each interval is closed(inclusive),'right' or 'left'
    • label: in downsampling,how to label the aggregated result,with the 'right' or 'left' bin edge(e.g, the 9:30 to 9:35 five-minute interval could be labeled 9:30 or 9:35)
    • loffset:Time adjustment to the bin labels,such as '-1s'/Second(-1) to shift the aggregate labels one second erlier
    • limit:when forward or backward filling,the maxinum number of periods to fill
    • kind:Aggregate to periods('period') or timestamps('timestamp');default to the type of index the time series has
    • convention: when resampling periods,the convention('start' or 'end') for converting the low-frequency period to high frequency;defaults to 'end'.

    Downsampling

    Aggregating data to a regular,lower frequency is a pretty normal time series task.The data you are aggregating does not need to be fixed frequently;the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate.For example, to convert to monthly,'M' or 'BM',you need to chop up the data into one-month interval,and the union of the intervals must make up the whole time frame.There are a couple things to think about when using resemple to downsample data:

    • Which side of each interval is closed
    • How to label each aggregated bin,either with the start of the interval or the end
    rng=pd.date_range('2000-01-01',periods=12,freq='T')
    
    ts=pd.Series(np.arange(12),index=rng)
    
    ts
    
    2000-01-01 00:00:00     0
    2000-01-01 00:01:00     1
    2000-01-01 00:02:00     2
    2000-01-01 00:03:00     3
    2000-01-01 00:04:00     4
    2000-01-01 00:05:00     5
    2000-01-01 00:06:00     6
    2000-01-01 00:07:00     7
    2000-01-01 00:08:00     8
    2000-01-01 00:09:00     9
    2000-01-01 00:10:00    10
    2000-01-01 00:11:00    11
    Freq: T, dtype: int32
    
    ts.resample('5min',closed='right').sum() # when closed='right',bin is the form of (],the index of ts is the right side of bin
    #(23:55:00],(00:00,00:05],(00:05,00:10],(00:10,00:15]
    
    1999-12-31 23:55:00     0
    2000-01-01 00:00:00    15
    2000-01-01 00:05:00    40
    2000-01-01 00:10:00    11
    Freq: 5T, dtype: int32
    
    ts.resample('5min',closed='right',label='right').sum() #The same as above,but showing labels
    #(00:00,00:05],(00:05,00:10],(00:10,00:15]
    
    2000-01-01 00:00:00     0
    2000-01-01 00:05:00    15
    2000-01-01 00:10:00    40
    2000-01-01 00:15:00    11
    Freq: 5T, dtype: int32
    
    ts.resample('5min',closed='left').sum()# [00:00,00:05),[00:05,00:10),[00:10,00:15), when closed='left',bin is the form of [),and 
    # the index of ts is the left side of bin.
    
    2000-01-01 00:00:00    10
    2000-01-01 00:05:00    35
    2000-01-01 00:10:00    21
    Freq: 5T, dtype: int32
    
    ts.resample('5min',closed='left',label='right').sum()# The same as above,but showing labels.
    
    2000-01-01 00:05:00    10
    2000-01-01 00:10:00    35
    2000-01-01 00:15:00    21
    Freq: 5T, dtype: int32
    

    Lastly,you might want to shift the result index by some amount,say subtracting one second from the right edge to make it more clear which interval the timestamp refers to.To do this,pass a string or date offset to loffset:

    ts.resample('5min',closed='right',label='right',loffset='-1s').sum()
    
    1999-12-31 23:59:59     0
    2000-01-01 00:04:59    15
    2000-01-01 00:09:59    40
    2000-01-01 00:14:59    11
    Freq: 5T, dtype: int32
    

    You also could have accomplished the effect of loffset by calling the shift method on the result without the loffset.

    ts.resample('5min',closed='right').sum().shift(1,'-1s')
    
    1999-12-31 23:54:59     0
    1999-12-31 23:59:59    15
    2000-01-01 00:04:59    40
    2000-01-01 00:09:59    11
    Freq: 5T, dtype: int32
    

    Open-High-Low-Close(OHLC) resampling

    In finance,a popular way to aggregate a time series is to compute four values for each bucket:the first(open),last(close),maximum(high),minimal(low) values.By using the ohlc aggregate function you will obtain a DataFrame having columns containing these four aggregates,which are effectively computed in a single sweep of the data:

    ts.resample('5min').ohlc()
    
    open high low close
    2000-01-01 00:00:00 0 4 0 4
    2000-01-01 00:05:00 5 9 5 9
    2000-01-01 00:10:00 10 11 10 11

    Upsampling and interpolation

    When converting from a low frequency to a higher frequency,no aggregation is needed.

    frame=pd.DataFrame(np.random.randn(2,4),index=pd.date_range('1/1/2000',periods=2,freq='W-WED'),columns=['Colorado','Texa','New York','Ohio'])
    
    frame
    
    Colorado Texa New York Ohio
    2000-01-05 -0.739109 0.781223 0.570884 -1.616556
    2000-01-12 -0.179902 2.531560 -2.658008 0.946870

    When you are using an aggregation function with this data,there is only one value per group,and missinig values result in the gaps.We use the
    asfreq mehtond to convert the higher frequency without any aggregation:

    df_daily=frame.resample('D').asfreq()
    
    df_daily
    
    Colorado Texa New York Ohio
    2000-01-05 -0.739109 0.781223 0.570884 -1.616556
    2000-01-06 NaN NaN NaN NaN
    2000-01-07 NaN NaN NaN NaN
    2000-01-08 NaN NaN NaN NaN
    2000-01-09 NaN NaN NaN NaN
    2000-01-10 NaN NaN NaN NaN
    2000-01-11 NaN NaN NaN NaN
    2000-01-12 -0.179902 2.531560 -2.658008 0.946870

    Suppose you wanted to fill forward each weekly value on the non-Wednesdays.The same filling or interpolation methods available in the fillna and reindex methods are available for resampling:

    frame.resample('D').ffill()
    
    Colorado Texa New York Ohio
    2000-01-05 -0.739109 0.781223 0.570884 -1.616556
    2000-01-06 -0.739109 0.781223 0.570884 -1.616556
    2000-01-07 -0.739109 0.781223 0.570884 -1.616556
    2000-01-08 -0.739109 0.781223 0.570884 -1.616556
    2000-01-09 -0.739109 0.781223 0.570884 -1.616556
    2000-01-10 -0.739109 0.781223 0.570884 -1.616556
    2000-01-11 -0.739109 0.781223 0.570884 -1.616556
    2000-01-12 -0.179902 2.531560 -2.658008 0.946870

    You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:

    frame.resample('D').ffill(limit=2)
    
    Colorado Texa New York Ohio
    2000-01-05 -0.739109 0.781223 0.570884 -1.616556
    2000-01-06 -0.739109 0.781223 0.570884 -1.616556
    2000-01-07 -0.739109 0.781223 0.570884 -1.616556
    2000-01-08 NaN NaN NaN NaN
    2000-01-09 NaN NaN NaN NaN
    2000-01-10 NaN NaN NaN NaN
    2000-01-11 NaN NaN NaN NaN
    2000-01-12 -0.179902 2.531560 -2.658008 0.946870

    Notably,the new date index need not overlap with the old one at all:

    frame.resample('W-THU').ffill()
    
    Colorado Texa New York Ohio
    2000-01-06 -0.739109 0.781223 0.570884 -1.616556
    2000-01-13 -0.179902 2.531560 -2.658008 0.946870

    Resample with periods

    Resampling data indexed by periods is similar to timestamps:

    frame=pd.DataFrame(np.random.randn(24,4),index=pd.period_range('1-2000','12-2001',freq='M'),columns=['Colorado','Texas','New York','Ohio'])
    
    frame[:5]
    
    Colorado Texas New York Ohio
    2000-01 -0.829993 -1.129430 1.320036 0.275144
    2000-02 2.511115 -0.306541 0.472983 0.220395
    2000-03 -0.037656 0.776638 0.428096 -0.274698
    2000-04 -1.116895 -0.353303 -0.642274 1.469136
    2000-05 0.975105 -1.160983 0.459956 -0.834690
    annual_frame=frame.resample('A-DEC').mean()
    
    annual_frame
    
    Colorado Texas New York Ohio
    2000 0.002307 -0.277951 -0.030337 -0.273060
    2001 0.126581 0.275503 -0.209550 0.241073

    Upsampling is more nuance ,as you must make a decision about which end of the timespan in the new frequency to place the values before resampling,just like the asfreq method.The convention argument defaults to start but can also be end.

    annual_frame.resample('Q-DEC').ffill()
    
    Colorado Texas New York Ohio
    2000Q1 0.002307 -0.277951 -0.030337 -0.273060
    2000Q2 0.002307 -0.277951 -0.030337 -0.273060
    2000Q3 0.002307 -0.277951 -0.030337 -0.273060
    2000Q4 0.002307 -0.277951 -0.030337 -0.273060
    2001Q1 0.126581 0.275503 -0.209550 0.241073
    2001Q2 0.126581 0.275503 -0.209550 0.241073
    2001Q3 0.126581 0.275503 -0.209550 0.241073
    2001Q4 0.126581 0.275503 -0.209550 0.241073
    annual_frame.resample('Q-DEC',convention='end').ffill()
    
    Colorado Texas New York Ohio
    2000Q4 0.002307 -0.277951 -0.030337 -0.273060
    2001Q1 0.002307 -0.277951 -0.030337 -0.273060
    2001Q2 0.002307 -0.277951 -0.030337 -0.273060
    2001Q3 0.002307 -0.277951 -0.030337 -0.273060
    2001Q4 0.126581 0.275503 -0.209550 0.241073

    Moving window functions

    An important class of array transformations used for time Series operations are statistics and other functinos evaluated over a sliding window or with exponentially decaying weights.This can be useful for smoothing noisy or gappy data.

    close_px_all=pd.read_csv(r'./pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
    
    close_px_all.head()
    
    AAPL MSFT XOM SPX
    2003-01-02 7.40 21.11 29.22 909.03
    2003-01-03 7.45 21.14 29.24 908.59
    2003-01-06 7.45 21.52 29.96 929.01
    2003-01-07 7.43 21.93 28.95 922.93
    2003-01-08 7.28 21.31 28.83 909.93
    close_px=close_px_all[['AAPL','MSFT','XOM']]
    
    close_px=close_px.resample('B').ffill()
    

    rollingoperator behaves similarly to resample and groupby.It can be called on a Series or DataFrame along with a window:

    close_px.AAPL.plot()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f2085ccf8>
    

    png

    close_px.AAPL.rolling(300).mean().plot()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f20abec50>
    

    png

    close_px.AAPL.rolling(250).mean().plot()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f20b7fb00>
    

    png

    help(close_px.AAPL.rolling)
    
    Help on method rolling in module pandas.core.generic:
    
    rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None) method of pandas.core.series.Series instance
        Provide rolling window calculations.
        
        .. versionadded:: 0.18.0
        
        Parameters
        ----------
        window : int, or offset
            Size of the moving window. This is the number of observations used for
            calculating the statistic. Each window will be a fixed size.
        
            If its an offset then this will be the time period of each window. Each
            window will be a variable sized based on the observations included in
            the time-period. This is only valid for datetimelike indexes. This is
            new in 0.19.0
        min_periods : int, default None
            Minimum number of observations in window required to have a value
            (otherwise result is NA). For a window that is specified by an offset,
            `min_periods` will default to 1. Otherwise, `min_periods` will default
            to the size of the window.
        center : bool, default False
            Set the labels at the center of the window.
        win_type : str, default None
            Provide a window type. If ``None``, all points are evenly weighted.
            See the notes below for further information.
        on : str, optional
            For a DataFrame, a datetime-like column on which to calculate the rolling
            window, rather than the DataFrame's index. Provided integer column is
            ignored and excluded from result since an integer index is not used to
            calculate the rolling window.
        axis : int or str, default 0
        closed : str, default None
            Make the interval closed on the 'right', 'left', 'both' or
            'neither' endpoints.
            For offset-based windows, it defaults to 'right'.
            For fixed windows, defaults to 'both'. Remaining cases not implemented
            for fixed windows.
        
            .. versionadded:: 0.20.0
        
        Returns
        -------
        a Window or Rolling sub-classed for the particular operation
        
        See Also
        --------
        expanding : Provides expanding transformations.
        ewm : Provides exponential weighted functions.
        
        Notes
        -----
        By default, the result is set to the right edge of the window. This can be
        changed to the center of the window by setting ``center=True``.
        
        To learn more about the offsets & frequency strings, please see `this link
        <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
        
        The recognized win_types are:
        
        * ``boxcar``
        * ``triang``
        * ``blackman``
        * ``hamming``
        * ``bartlett``
        * ``parzen``
        * ``bohman``
        * ``blackmanharris``
        * ``nuttall``
        * ``barthann``
        * ``kaiser`` (needs beta)
        * ``gaussian`` (needs std)
        * ``general_gaussian`` (needs power, width)
        * ``slepian`` (needs width)
        * ``exponential`` (needs tau), center is set to None.
        
        If ``win_type=None`` all points are evenly weighted. To learn more about
        different window types see `scipy.signal window functions
        <https://docs.scipy.org/doc/scipy/reference/signal.html#window-functions>`__.
        
        Examples
        --------
        
        >>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})
        >>> df
             B
        0  0.0
        1  1.0
        2  2.0
        3  NaN
        4  4.0
        
        Rolling sum with a window length of 2, using the 'triang'
        window type.
        
        >>> df.rolling(2, win_type='triang').sum()
             B
        0  NaN
        1  0.5
        2  1.5
        3  NaN
        4  NaN
        
        Rolling sum with a window length of 2, min_periods defaults
        to the window length.
        
        >>> df.rolling(2).sum()
             B
        0  NaN
        1  1.0
        2  3.0
        3  NaN
        4  NaN
        
        Same as above, but explicitly set the min_periods
        
        >>> df.rolling(2, min_periods=1).sum()
             B
        0  0.0
        1  1.0
        2  3.0
        3  2.0
        4  4.0
        
        A ragged (meaning not-a-regular frequency), time-indexed DataFrame
        
        >>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
        ...                   index = [pd.Timestamp('20130101 09:00:00'),
        ...                            pd.Timestamp('20130101 09:00:02'),
        ...                            pd.Timestamp('20130101 09:00:03'),
        ...                            pd.Timestamp('20130101 09:00:05'),
        ...                            pd.Timestamp('20130101 09:00:06')])
        
        >>> df
                               B
        2013-01-01 09:00:00  0.0
        2013-01-01 09:00:02  1.0
        2013-01-01 09:00:03  2.0
        2013-01-01 09:00:05  NaN
        2013-01-01 09:00:06  4.0
        
        Contrasting to an integer rolling window, this will roll a variable
        length window corresponding to the time period.
        The default for min_periods is 1.
        
        >>> df.rolling('2s').sum()
                               B
        2013-01-01 09:00:00  0.0
        2013-01-01 09:00:02  1.0
        2013-01-01 09:00:03  3.0
        2013-01-01 09:00:05  NaN
        2013-01-01 09:00:06  4.0
    

    To illustrate the meaning of rolling,look at the following example:

    index=pd.date_range('2020-5-5',periods=20)
    
    df=pd.DataFrame(np.arange(20),index=index,columns=['test']);df
    
    test
    2020-05-05 0
    2020-05-06 1
    2020-05-07 2
    2020-05-08 3
    2020-05-09 4
    2020-05-10 5
    2020-05-11 6
    2020-05-12 7
    2020-05-13 8
    2020-05-14 9
    2020-05-15 10
    2020-05-16 11
    2020-05-17 12
    2020-05-18 13
    2020-05-19 14
    2020-05-20 15
    2020-05-21 16
    2020-05-22 17
    2020-05-23 18
    2020-05-24 19
    df['sum']=df.test.rolling(3).sum()
    
    df['mean']=df.test.rolling(3).mean()
    
    
    df['mean1'] = df.test.rolling(3,min_periods=2).mean()
    
    df['expanding']=df.test.expanding().mean()
    
    df
    
    test sum mean mean1 expanding
    2020-05-05 0 NaN NaN NaN 0.0
    2020-05-06 1 NaN NaN 0.5 0.5
    2020-05-07 2 3.0 1.0 1.0 1.0
    2020-05-08 3 6.0 2.0 2.0 1.5
    2020-05-09 4 9.0 3.0 3.0 2.0
    2020-05-10 5 12.0 4.0 4.0 2.5
    2020-05-11 6 15.0 5.0 5.0 3.0
    2020-05-12 7 18.0 6.0 6.0 3.5
    2020-05-13 8 21.0 7.0 7.0 4.0
    2020-05-14 9 24.0 8.0 8.0 4.5
    2020-05-15 10 27.0 9.0 9.0 5.0
    2020-05-16 11 30.0 10.0 10.0 5.5
    2020-05-17 12 33.0 11.0 11.0 6.0
    2020-05-18 13 36.0 12.0 12.0 6.5
    2020-05-19 14 39.0 13.0 13.0 7.0
    2020-05-20 15 42.0 14.0 14.0 7.5
    2020-05-21 16 45.0 15.0 15.0 8.0
    2020-05-22 17 48.0 16.0 16.0 8.5
    2020-05-23 18 51.0 17.0 17.0 9.0
    2020-05-24 19 54.0 18.0 18.0 9.5
    df['test'].rolling(2).agg([np.sum,np.mean]) #Using agg function to return several results.
    
    sum mean
    2020-05-05 NaN NaN
    2020-05-06 1.0 0.5
    2020-05-07 3.0 1.5
    2020-05-08 5.0 2.5
    2020-05-09 7.0 3.5
    2020-05-10 9.0 4.5
    2020-05-11 11.0 5.5
    2020-05-12 13.0 6.5
    2020-05-13 15.0 7.5
    2020-05-14 17.0 8.5
    2020-05-15 19.0 9.5
    2020-05-16 21.0 10.5
    2020-05-17 23.0 11.5
    2020-05-18 25.0 12.5
    2020-05-19 27.0 13.5
    2020-05-20 29.0 14.5
    2020-05-21 31.0 15.5
    2020-05-22 33.0 16.5
    2020-05-23 35.0 17.5
    2020-05-24 37.0 18.5

    Look at the result above,since window=3,so the first two datas are Nan.In terms of 'sum',3=0+1+2,6=1+2+3,9=2+3+4...

    The expression rolling(250) is similar in behaviour to groupby,but instead of grouping it, it creates an object that grouping over a 250-day sliding window.By default rolling functions require all of the values in the window to be non-Na.This behaviour can be changed to account for missing and ,in particular ,the fact that you will have fewer than window periods of data at the begining of the time series:

    appl_std250=close_px.AAPL.rolling(250,min_periods=10).std()
    
    appl_std250[5:12]
    
    2003-01-09         NaN
    2003-01-10         NaN
    2003-01-13         NaN
    2003-01-14         NaN
    2003-01-15    0.077496
    2003-01-16    0.074760
    2003-01-17    0.112368
    Freq: B, Name: AAPL, dtype: float64
    
    appl_std250.plot()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f23f70cf8>
    

    png

    In order to compute an expanding window mean,use the expanding operator instead of rolling.The expanding mean starts the time window from the begining of the time series and increases the size of the window until it encompasses the whole series.

    To illustrate expanding,look at the following example:

    df.test
    
    2020-05-05     0
    2020-05-06     1
    2020-05-07     2
    2020-05-08     3
    2020-05-09     4
    2020-05-10     5
    2020-05-11     6
    2020-05-12     7
    2020-05-13     8
    2020-05-14     9
    2020-05-15    10
    2020-05-16    11
    2020-05-17    12
    2020-05-18    13
    2020-05-19    14
    2020-05-20    15
    2020-05-21    16
    2020-05-22    17
    2020-05-23    18
    2020-05-24    19
    Freq: D, Name: test, dtype: int32
    
    df.test.expanding().mean()
    
    2020-05-05    0.0
    2020-05-06    0.5
    2020-05-07    1.0
    2020-05-08    1.5
    2020-05-09    2.0
    2020-05-10    2.5
    2020-05-11    3.0
    2020-05-12    3.5
    2020-05-13    4.0
    2020-05-14    4.5
    2020-05-15    5.0
    2020-05-16    5.5
    2020-05-17    6.0
    2020-05-18    6.5
    2020-05-19    7.0
    2020-05-20    7.5
    2020-05-21    8.0
    2020-05-22    8.5
    2020-05-23    9.0
    2020-05-24    9.5
    Freq: D, Name: test, dtype: float64
    

    Compared with rolling,expanding's window is varialble like cumsum, and rolling's window is fixed.

    expanding_mean=appl_std250.expanding().mean()
    
    expanding_mean
    
    2003-01-02          NaN
    2003-01-03          NaN
    2003-01-06          NaN
    2003-01-07          NaN
    2003-01-08          NaN
                    ...    
    2011-10-10    18.521201
    2011-10-11    18.524272
    2011-10-12    18.527385
    2011-10-13    18.530554
    2011-10-14    18.533823
    Freq: B, Name: AAPL, Length: 2292, dtype: float64
    
    close_px.rolling(60).mean().plot(logy=True)
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f240f87f0>
    

    png

    The rolling function also accepts a string indicating a fixed-size time offset rather than a set number of period.Using this notation can be useful for irregular time series.These are the same string that can be passed to resample.

    ser=pd.Series(np.arange(6),pd.date_range('2020-05-05',periods=6));ser
    
    2020-05-05    0
    2020-05-06    1
    2020-05-07    2
    2020-05-08    3
    2020-05-09    4
    2020-05-10    5
    Freq: D, dtype: int32
    
    ser.rolling('2D').mean()
    
    2020-05-05    0.0
    2020-05-06    0.5
    2020-05-07    1.5
    2020-05-08    2.5
    2020-05-09    3.5
    2020-05-10    4.5
    Freq: D, dtype: float64
    
    ser.rolling('D').mean()
    
    2020-05-05    0.0
    2020-05-06    1.0
    2020-05-07    2.0
    2020-05-08    3.0
    2020-05-09    4.0
    2020-05-10    5.0
    Freq: D, dtype: float64
    
    close_px.rolling('20D').mean()
    
    AAPL MSFT XOM
    2003-01-02 7.400000 21.110000 29.220000
    2003-01-03 7.425000 21.125000 29.230000
    2003-01-06 7.433333 21.256667 29.473333
    2003-01-07 7.432500 21.425000 29.342500
    2003-01-08 7.402000 21.402000 29.240000
    ... ... ... ...
    2011-10-10 389.351429 25.602143 72.527857
    2011-10-11 388.505000 25.674286 72.835000
    2011-10-12 388.531429 25.810000 73.400714
    2011-10-13 388.826429 25.961429 73.905000
    2011-10-14 391.038000 26.048667 74.185333

    2292 rows × 3 columns

    Exponentially weighted functions

    An alternative to using a static window size with equally weighted observations is to sepcify a constant decay factor to give more weight to more recent observations.There are a couple of ways to specify the decay factor. A popular one is using a span,which makes the result comparable to a simple moving window function with window size equal to the span.

    Pandas has the ewmoperator to go along with rolling and expanding.Here is an example comparing a 60-day moving average of Apple's stock price with an EW moving average with span=60:

    appl_px=close_px.AAPL['2006':'2007']
    
    ma60=appl_px.rolling(30,min_periods=20).mean()
    
    ewma60=appl_px.ewm(span=30).mean()
    
    ma60.plot(label='simple MA')
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f24293390>
    

    png

    ewma60.plot(label='EW MA')
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f24201128>
    

    png

    User-defined moving window functions

    The apply method on rolling and related methods provides a means to apply an array function of your own devising over a moving window.The only requirement is that the funciton produce a single value(a reduction) from each piece of the array.For example,while we can compute sample quantiles using rolling(...)quantile(q),we might be interested in the percentile rank of a particular value over the sample.

    from scipy.stats import percentileofscore
    
    score_at_2percent=lambda x:percentileofscore(x,0.02)
    
    results=close_px.pct_change()
    
    result=results.AAPL.rolling(250).apply(score_at_2percent)
    
    C:Users旺仔QQ糖AppDataRoamingPythonPython36site-packagesipykernel\__main__.py:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
      if __name__ == '__main__':
    
    result.plot()
    
    <matplotlib.axes._subplots.AxesSubplot at 0x14f2608ec88>
    

    png

    
    
    ##### 愿你一寸一寸地攻城略地,一点一点地焕然一新 #####
  • 相关阅读:
    Manjaro i3wm替换默认程序配置
    我的Alacritty配置
    我的compton配置
    我的i3config配置
    我的Manjaro rofi配置
    我的Manjaro i3status配置说明
    我的Manjaro i3自用软件记录
    我的Manjaro i3wm安装记录
    数组介绍
    进制、位运算笔记
  • 原文地址:https://www.cnblogs.com/johnyang/p/12830402.html
Copyright © 2011-2022 走看看