zoukankan html css js c++ java

Time Series in pandas

Time Series

import pandas as pd

import numpy as np

Date and Time data types and tools

from datetime import datetime
from datetime import date
from datetime import timedelta

now=datetime.now()

now

datetime.datetime(2020, 5, 5, 9, 51, 27, 686891)

now.year,now.month,now.day

(2020, 5, 5)

now.time()

datetime.time(9, 51, 27, 686891)

datetime stores both the date and time down to the microsecond.timedelta represents the temporal difference between two datetime objects.

delta=datetime(2020,4,29)-datetime(2018,8,19)

delta

datetime.timedelta(619)

delta.days

delta.seconds

You can add(or subtract) a timedelta or multiple thereof to a datetime object to yield a new shifted object:

start=datetime(2020,4,30)

start+timedelta(100)

datetime.datetime(2020, 8, 8, 0, 0)

datetime.now()

datetime.datetime(2020, 5, 5, 9, 51, 27, 888352)

date(2008,8,8)-date(2000,8,8)

datetime.timedelta(2922)

Converting between string and datetime

You can format datetime and pandas Timestamp objects ,as string using str or strftime method,passing a format specification.

stamp=datetime(2011,1,3)

str(stamp)

'2011-01-03 00:00:00'

stamp.strftime('%Y--%m-!-%d') # can be meomorized by 'str from time', datetime object instance method

'2011--01-!-03'

description={'%Y':'four-digit year',
           '%y':'two-digit year',
           '%m':'two-digit month[01,12]',
           '%d':'two-digit day[01,31]',
           '%H':'Hour(24-hour clock)[00,23]',
           '%I':'Hour(12-hour clock)[01,12]',
           '%M':'two-digit minute[00,59]',
           '%S':'second[00,61](seconds60,61 account for leap seconds)',
           '%w':'weekday as integer[0(sunday),6]',
           '%U':'week number of the year[00,53];sunday is considered the first day of the first day of the week,and days before the first sunday of the year are "week 0"',
           '%W':'week number of the year[00,53];monday is considered the first day of the week and days before the first monday of the year are "week 0"',
           '%z':'UTC time zone offset as +HHMM or -HHMM;empty if time zone naive',
           '%F':'shortcut for %Y-%m-%d',
           '%D':'shortcut for %m/%d/%y(e.g04/18/12)'}

pd.DataFrame(description,index=[0])

	%Y	%y	%m	%d	%H	%I	%M	%S	%w	%U	%W	%z	%F	%D
0	four-digit year	two-digit year	two-digit month[01,12]	two-digit day[01,31]	Hour(24-hour clock)[00,23]	Hour(12-hour clock)[01,12]	two-digit minute[00,59]	second[00,61](seconds60,61 account for leap se...	weekday as integer[0(sunday),6]	week number of the year[00,53];sunday is consi...	week number of the year[00,53];monday is consi...	UTC time zone offset as +HHMM or -HHMM;empty i...	shortcut for %Y-%m-%d	shortcut for %m/%d/%y(e.g04/18/12)

pd.DataFrame(description,index=[0]).stack()

0  %Y                                      four-digit year
   %y                                       two-digit year
   %m                               two-digit month[01,12]
   %d                                 two-digit day[01,31]
   %H                           Hour(24-hour clock)[00,23]
   %I                           Hour(12-hour clock)[01,12]
   %M                              two-digit minute[00,59]
   %S    second[00,61](seconds60,61 account for leap se...
   %w                      weekday as integer[0(sunday),6]
   %U    week number of the year[00,53];sunday is consi...
   %W    week number of the year[00,53];monday is consi...
   %z    UTC time zone offset as +HHMM or -HHMM;empty i...
   %F                                shortcut for %Y-%m-%d
   %D                   shortcut for %m/%d/%y(e.g04/18/12)
dtype: object

You can use these same format codes to convert strings to dates using datetime.strptime:

value='2011-01-03'

datetime.strptime(value,'%Y-%m-%d')  # can be memorized by 'str produce time'  datetime model method

datetime.datetime(2011, 1, 3, 0, 0)

datestrs=['7/6/2011','8/6/2011']

[datetime.strptime(x,'%m/%d/%Y') for x in datestrs]

[datetime.datetime(2011, 7, 6, 0, 0), datetime.datetime(2011, 8, 6, 0, 0)]

datetime.strptime is a good way to parse a date with a known format.However,it can be a bit annoyning to have to write a format spec each time,especially for common date formats.In this case,you can use the parser.parse method in the third-party dateutil package(this is installed automatically when you install pandas):

from dateutil.parser import parse  # notice that, it is `parse` not `parser`!

parse('2020--4--30')

datetime.datetime(2020, 4, 30, 0, 0)

dateutil is capable of parsing most human-intelligible date representation:

parse('Jan 31,1997 10:45PM')  #? what happened here?

datetime.datetime(2020, 1, 31, 22, 45)

In international locales,day appearing before month is very common,so you can pass dayfirst=True to indicate this:

parse('6/12/2022',dayfirst=True)

datetime.datetime(2022, 12, 6, 0, 0)

pandas is generally oriented toward working with arrays of date,whether used as an axis index or a column in a DataFrame.The to_datetime method parses many different kinds of date representations.Standard date formats like ISO 8601 can be parsed very quickly.

datestrs=['2011-07-16 12:00:00','2011-08-06 00:00:00']

pd.to_datetime(['2011-07-16 ','2011-08-06 '])

DatetimeIndex(['2011-07-16', '2011-08-06'], dtype='datetime64[ns]', freq=None)

pd.to_datetime(datestrs)

DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00'], dtype='datetime64[ns]', freq=None)

It also handles values that should be considered missing (None,empty string,etc.):

idx=pd.to_datetime(datestrs+[None]);idx

DatetimeIndex(['2011-07-16 12:00:00', '2011-08-06 00:00:00', 'NaT'], dtype='datetime64[ns]', freq=None)

NaT(not a time) is pandas's null value for timestamp data.

Notice
dateutil.parse is a useful but imperfect tool.

Time series basics

A basic kind of time series object in pandas is a Series indexed by timestamps,which is often represented external to pandas as Python strings or datetime objects.

dates=[datetime(2020,1,2),datetime(2020,1,5),datetime(2020,1,7),datetime(2020,1,8),datetime(2020,1,10),datetime(2020,1,12)]

ts=pd.Series(np.random.randn(6),index=dates);ts

2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64

pd.Series(np.random.randn(3),index=idx)

2011-07-16 12:00:00    2.259805
2011-08-06 00:00:00   -0.877063
NaT                   -0.697678
dtype: float64

Under the hood,these datetime objects have been put in a DatetimeIndex:

ts.index

DatetimeIndex(['2020-01-02', '2020-01-05', '2020-01-07', '2020-01-08',
               '2020-01-10', '2020-01-12'],
              dtype='datetime64[ns]', freq=None)

Like other Series,arithmetic operations between differently indexed time series automatically align on the dates:

ts[::2]

2020-01-02   -1.140949
2020-01-07   -0.046164
2020-01-10   -0.126047
dtype: float64

ts+ts[::2]

2020-01-02   -2.281898
2020-01-05         NaN
2020-01-07   -0.092328
2020-01-08         NaN
2020-01-10   -0.252095
2020-01-12         NaN
dtype: float64

pd.Series([1,2,3])+pd.Series([3,4])

0    4.0
1    6.0
2    NaN
dtype: float64

pandas stores timestamps using Numpy's datetime64 data type at the nanosecond resolution.

ts.index.dtype

dtype('<M8[ns]')

Scalar values from a DatetimeIndex are pandas Timestamp objects:

stamp=ts.index[0];stamp

Timestamp('2020-01-02 00:00:00')

A Timestamp can be substituted anywhere you would use a datetime object.Addtionally,it can store frequency information(if any) and understands how to do time zone conversions and other kinds of manipulations.

Indexing,selection,subsetting

ts

2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64

Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:

stamp=ts.index[2]

ts[stamp]

-0.04616414830843706

As a convenience,you can also pass a string that is interpretable as a date:

ts['1/10/2020']

-0.12604738036158042

ts['2020-1-10']

-0.12604738036158042

For longer time series,a year or only a year and month can be passed to easily select slices of data:

help(pd.date_range)

Help on function date_range in module pandas.core.indexes.datetimes:

date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
    Return a fixed frequency DatetimeIndex.
    
    Parameters
    ----------
    start : str or datetime-like, optional
        Left bound for generating dates.
    end : str or datetime-like, optional
        Right bound for generating dates.
    periods : integer, optional
        Number of periods to generate.
    freq : str or DateOffset, default 'D'
        Frequency strings can have multiples, e.g. '5H'. See
        :ref:`here <timeseries.offset_aliases>` for a list of
        frequency aliases.
    tz : str or tzinfo, optional
        Time zone name for returning localized DatetimeIndex, for example
        'Asia/Hong_Kong'. By default, the resulting DatetimeIndex is
        timezone-naive.
    normalize : bool, default False
        Normalize start/end dates to midnight before generating date range.
    name : str, default None
        Name of the resulting DatetimeIndex.
    closed : {None, 'left', 'right'}, optional
        Make the interval closed with respect to the given frequency to
        the 'left', 'right', or both sides (None, the default).
    **kwargs
        For compatibility. Has no effect on the result.
    
    Returns
    -------
    rng : DatetimeIndex
    
    See Also
    --------
    DatetimeIndex : An immutable container for datetimes.
    timedelta_range : Return a fixed frequency TimedeltaIndex.
    period_range : Return a fixed frequency PeriodIndex.
    interval_range : Return a fixed frequency IntervalIndex.
    
    Notes
    -----
    Of the four parameters ``start``, ``end``, ``periods``, and ``freq``,
    exactly three must be specified. If ``freq`` is omitted, the resulting
    ``DatetimeIndex`` will have ``periods`` linearly spaced elements between
    ``start`` and ``end`` (closed on both sides).
    
    To learn more about the frequency strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    Examples
    --------
    **Specifying the values**
    
    The next four examples generate the same `DatetimeIndex`, but vary
    the combination of `start`, `end` and `periods`.
    
    Specify `start` and `end`, with the default daily frequency.
    
    >>> pd.date_range(start='1/1/2018', end='1/08/2018')
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start` and `periods`, the number of periods (days).
    
    >>> pd.date_range(start='1/1/2018', periods=8)
    DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
                   '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `end` and `periods`, the number of periods (days).
    
    >>> pd.date_range(end='1/1/2018', periods=8)
    DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
                   '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
                  dtype='datetime64[ns]', freq='D')
    
    Specify `start`, `end`, and `periods`; the frequency is generated
    automatically (linearly spaced).
    
    >>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
    DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
                   '2018-04-27 00:00:00'],
                  dtype='datetime64[ns]', freq=None)
    
    **Other Parameters**
    
    Changed the `freq` (frequency) to ``'M'`` (month end frequency).
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='M')
    DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
                   '2018-05-31'],
                  dtype='datetime64[ns]', freq='M')
    
    Multiples are allowed
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    `freq` can also be specified as an Offset object.
    
    >>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
    DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
                   '2019-01-31'],
                  dtype='datetime64[ns]', freq='3M')
    
    Specify `tz` to set the timezone.
    
    >>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
    DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
                   '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
                   '2018-01-05 00:00:00+09:00'],
                  dtype='datetime64[ns, Asia/Tokyo]', freq='D')
    
    `closed` controls whether to include `start` and `end` that are on the
    boundary. The default includes boundary points on either end.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed=None)
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='left'`` to exclude `end` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='left')
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
                  dtype='datetime64[ns]', freq='D')
    
    Use ``closed='right'`` to exclude `start` if it falls on the boundary.
    
    >>> pd.date_range(start='2017-01-01', end='2017-01-04', closed='right')
    DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
                  dtype='datetime64[ns]', freq='D')

pd.date_range('2020-05-01',periods=5)

DatetimeIndex(['2020-05-01', '2020-05-02', '2020-05-03', '2020-05-04',
               '2020-05-05'],
              dtype='datetime64[ns]', freq='D')

longer_ts=pd.Series(np.random.randn(1000),index=pd.date_range('1/1/2000',periods=1000))

longer_ts['2001']

2001-01-01   -2.081715
2001-01-02    1.425891
2001-01-03    0.314430
2001-01-04    0.153332
2001-01-05    0.282888
                ...   
2001-12-27   -0.299994
2001-12-28    1.852834
2001-12-29    1.847192
2001-12-30    0.592563
2001-12-31    0.519122
Freq: D, Length: 365, dtype: float64

Here,the string '2001' is interpreted as a year and selects that time period.This also works if you specify the month:

longer_ts['2001-05']

2001-05-01   -0.408720
2001-05-02   -0.958682
2001-05-03   -0.424746
2001-05-04    0.771404
2001-05-05    1.959182
2001-05-06    0.287984
2001-05-07   -0.199789
2001-05-08   -0.369938
2001-05-09    0.309950
2001-05-10   -1.649661
2001-05-11    0.119676
2001-05-12    0.205413
2001-05-13    0.416938
2001-05-14   -0.305450
2001-05-15   -0.126385
2001-05-16    1.665036
2001-05-17    0.627492
2001-05-18   -1.317637
2001-05-19   -2.734170
2001-05-20   -0.163745
2001-05-21   -0.784528
2001-05-22    0.151304
2001-05-23    0.583916
2001-05-24    0.571195
2001-05-25   -1.498402
2001-05-26   -1.485187
2001-05-27    0.411882
2001-05-28    0.323999
2001-05-29    0.627545
2001-05-30   -2.054165
2001-05-31   -1.493494
Freq: D, dtype: float64

longer_ts['2000-05-01']

-0.08042675710861961

Slicing with datetime objects works as well:

ts[datetime(2020,1,7)]

-0.04616414830843706

Because most time series data is ordered chronologically,you can slice with timestamps not contained in a time series to perform a range query:

ts

2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64

ts['2020-1-6':'2020-1-14'] # Notice that,'2020-1-6' and '2020-1-14' are not in ts.

2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64

As before,you can pass either a string date,datetime,or timestamp.Remeber that slicing in this manner produces views on the source time series like slicing Numpy arrays.This means that no data is copied and modifications on the slice will be reflected in the original data.

There is an equivalent instance method,truncate,that slices a Series between tow-dates:

ts.truncate(after='1/9/2020')

2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
dtype: float64

All of this holds true for DataFrame as well,indexing on its rows:

dates=pd.date_range('1/1/2000',periods=100,freq='w-wed')

long_df=pd.DataFrame(np.random.randn(100,4),index=dates,columns=['Colorado','Texas','New York','Ohio']) # Attention,type disscussed above is Series,and here,changed to be DataFrame

long_df.loc['5-2001']

	Colorado	Texas	New York	Ohio
2001-05-02	-0.157680	-0.398869	0.399008	-0.109813
2001-05-09	-0.475721	-0.550544	0.406308	0.290822
2001-05-16	-1.101315	2.469916	-0.062604	-0.409562
2001-05-23	0.972402	1.035692	-0.594960	1.255631
2001-05-30	-1.823161	-0.327003	-0.294791	0.795953

Time series with duplicate indices

In some applications,there may be multiple data observations failling on a particular timestamp.

dates=pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','1/3/2000'])

dup_ts=pd.Series(np.arange(5),index=dates)

dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int32

dup_ts.index.is_unique

False

Indexing into this time series will now either produce scalar values or slices depending on whether a timestamp is duplicated:

dup_ts['1/3/2000']

dup_ts['1/2/2000']

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int32

Suppose you wanted to aggregate the data having non-unique timestamps.One way to do this is to use groupby and pass level=0:

grouped=dup_ts.groupby(level=0)

grouped.mean()

2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int32

grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64

Date ranges,frequencies,and shifting

Generic time series in pandas are assumed to be irregular;that is,thet have no fixed frequency.For many applications this is sufficient.However,it's often desirable to work relative to a fixed frequency,such as daily,monthly,or every 15 minutes,even if that means introducing missing values into a time series. Fortunately pandas has a full suite of standard time series frequencies and tools for resampling,inferring frequencies,and generating fixed-frequency date ranges.

ts

2020-01-02   -1.140949
2020-01-05   -0.328999
2020-01-07   -0.046164
2020-01-08   -0.783714
2020-01-10   -0.126047
2020-01-12   -0.848602
dtype: float64

resampler=ts.resample('D');resampler

<pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20270320>

The string 'D' is interpreted as a daily frequency.Conversion between frequencies or resampling is a big enough topic to have its own section.

Generating date ranges

pd.date_range is responsible for generating a DatetimeIndex with an indicated length according to a particular frequency.

index=pd.date_range('2012-04-01','2012-06-01')

index

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20',
               '2012-04-21', '2012-04-22', '2012-04-23', '2012-04-24',
               '2012-04-25', '2012-04-26', '2012-04-27', '2012-04-28',
               '2012-04-29', '2012-04-30', '2012-05-01', '2012-05-02',
               '2012-05-03', '2012-05-04', '2012-05-05', '2012-05-06',
               '2012-05-07', '2012-05-08', '2012-05-09', '2012-05-10',
               '2012-05-11', '2012-05-12', '2012-05-13', '2012-05-14',
               '2012-05-15', '2012-05-16', '2012-05-17', '2012-05-18',
               '2012-05-19', '2012-05-20', '2012-05-21', '2012-05-22',
               '2012-05-23', '2012-05-24', '2012-05-25', '2012-05-26',
               '2012-05-27', '2012-05-28', '2012-05-29', '2012-05-30',
               '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

By default,date_range generates daily timestamps.If you pass only a start or end date,you must pass a number of periods to generate:

pd.date_range(start='2012-04-01',periods=20)

DatetimeIndex(['2012-04-01', '2012-04-02', '2012-04-03', '2012-04-04',
               '2012-04-05', '2012-04-06', '2012-04-07', '2012-04-08',
               '2012-04-09', '2012-04-10', '2012-04-11', '2012-04-12',
               '2012-04-13', '2012-04-14', '2012-04-15', '2012-04-16',
               '2012-04-17', '2012-04-18', '2012-04-19', '2012-04-20'],
              dtype='datetime64[ns]', freq='D')

pd.date_range(end='2012-06-1',periods=20)

DatetimeIndex(['2012-05-13', '2012-05-14', '2012-05-15', '2012-05-16',
               '2012-05-17', '2012-05-18', '2012-05-19', '2012-05-20',
               '2012-05-21', '2012-05-22', '2012-05-23', '2012-05-24',
               '2012-05-25', '2012-05-26', '2012-05-27', '2012-05-28',
               '2012-05-29', '2012-05-30', '2012-05-31', '2012-06-01'],
              dtype='datetime64[ns]', freq='D')

The start and end dates define strict boundaries for the generated date index.For example,if you wanted a date index containing the last business day of each month,you would pass the BM frequency(business end of month) and only dates falling on or inside the date interval will be included:

pd.date_range('2020-04-01','2020-07-01',freq='BM') #'BM'-->BusinessMonthEnd

DatetimeIndex(['2020-04-30', '2020-05-29', '2020-06-30'], dtype='datetime64[ns]', freq='BM')

pd.date_range('2020-05-01',periods=10,freq='w-fri') # 'w-mon,w-tue...'--->Weekly on given day of week

DatetimeIndex(['2020-05-01', '2020-05-08', '2020-05-15', '2020-05-22',
               '2020-05-29', '2020-06-05', '2020-06-12', '2020-06-19',
               '2020-06-26', '2020-07-03'],
              dtype='datetime64[ns]', freq='W-FRI')

pd.date_range('2020-05-01',periods=3,freq='b') # 'b'--> BusinessDay

DatetimeIndex(['2020-05-01', '2020-05-04', '2020-05-05'], dtype='datetime64[ns]', freq='B')

pd.date_range('2020-05-01',periods=4,freq='q-feb')# quarterly dates anchored on last calendar day of each month.

DatetimeIndex(['2020-05-31', '2020-08-31', '2020-11-30', '2021-02-28'], dtype='datetime64[ns]', freq='Q-FEB')

pd.date_range('2020-05-01',periods=4,freq='q-mar')# to-->march for the last item

DatetimeIndex(['2020-06-30', '2020-09-30', '2020-12-31', '2021-03-31'], dtype='datetime64[ns]', freq='Q-MAR')

date_range by default preserves the time(if any) of the start or end timestamp.

pd.date_range('2012-05-02 12:56:31',periods=5)

DatetimeIndex(['2012-05-02 12:56:31', '2012-05-03 12:56:31',
               '2012-05-04 12:56:31', '2012-05-05 12:56:31',
               '2012-05-06 12:56:31'],
              dtype='datetime64[ns]', freq='D')

Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention.To do this ,there is a normalize option:

pd.date_range('2012-05-02 12:56:31',periods=5,normalize=True)

DatetimeIndex(['2012-05-02', '2012-05-03', '2012-05-04', '2012-05-05',
               '2012-05-06'],
              dtype='datetime64[ns]', freq='D')

Frequencies and date offsets

Frequencies in pandas are composed of a base frequency and a multiplier.Base frequencies are typically referred to by a string alias,like M for monthly or H for hourly.For each base frequency, there is an object defined generally referred to as a date offset.

from pandas.tseries.offsets import Hour,Minute

hour=Hour()

hour

<Hour>

four_hour=Hour(4)

four_hour

<4 * Hours>

In most applications,you would never need to explicitly create one of these objects,instead using a string alias like 'H' or '4H'.Putting an integer before the base frequency creates a multiple:

pd.date_range('2000-01-01','2000-01-03 23:59',freq='4H')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 04:00:00',
               '2000-01-01 08:00:00', '2000-01-01 12:00:00',
               '2000-01-01 16:00:00', '2000-01-01 20:00:00',
               '2000-01-02 00:00:00', '2000-01-02 04:00:00',
               '2000-01-02 08:00:00', '2000-01-02 12:00:00',
               '2000-01-02 16:00:00', '2000-01-02 20:00:00',
               '2000-01-03 00:00:00', '2000-01-03 04:00:00',
               '2000-01-03 08:00:00', '2000-01-03 12:00:00',
               '2000-01-03 16:00:00', '2000-01-03 20:00:00'],
              dtype='datetime64[ns]', freq='4H')

Many offsets can be combined together by addition:

Hour(2)+Minute(30)

<150 * Minutes>

Similarly,you can pass frequency strings,like '1h30min',that will effectively be parsed to the same expression:

pd.date_range('2000-01-01',periods=10,freq='1h30min')

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00', '2000-01-01 07:30:00',
               '2000-01-01 09:00:00', '2000-01-01 10:30:00',
               '2000-01-01 12:00:00', '2000-01-01 13:30:00'],
              dtype='datetime64[ns]', freq='90T')

Some frequencies describe points in time that are not evenly spaced.For example,'M'(calendar month end) and 'BM'(last business/weekday of month) depend on the number of days in a month and, in the latter case ,whether the month ends on weekend or not.We refer to these as anchor offset.

week of month dates

One useful frequency class is 'week of month',starting with WOM.This enables you to get dates like the third Fridy of each month.

rng=pd.date_range('2012-01-01','2012-09-01',freq='WOM-3FRI')

list(rng)

[Timestamp('2012-01-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-02-17 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-03-16 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-04-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-05-18 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-06-15 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-07-20 00:00:00', freq='WOM-3FRI'),
 Timestamp('2012-08-17 00:00:00', freq='WOM-3FRI')]

Shifting(Leading and lagging)data

Shifting means moving data backward and forward through time.Both Series and DataFrame have a shift method for doing naive shifts forward,leaving the index unmodified:

ts=pd.Series(np.random.randn(4),index=pd.date_range('1/1/2020',periods=4,freq='M'))

help(ts.shift)

Help on method shift in module pandas.core.series:

shift(periods=1, freq=None, axis=0, fill_value=None) method of pandas.core.series.Series instance
    Shift index by desired number of periods with an optional time `freq`.
    
    When `freq` is not passed, shift the index without realigning the data.
    If `freq` is passed (in this case, the index must be date or datetime,
    or it will raise a `NotImplementedError`), the index will be
    increased using the periods and the `freq`.
    
    Parameters
    ----------
    periods : int
        Number of periods to shift. Can be positive or negative.
    freq : DateOffset, tseries.offsets, timedelta, or str, optional
        Offset to use from the tseries module or time rule (e.g. 'EOM').
        If `freq` is specified then the index values are shifted but the
        data is not realigned. That is, use `freq` if you would like to
        extend the index when shifting and preserve the original data.
    axis : {0 or 'index', 1 or 'columns', None}, default None
        Shift direction.
    fill_value : object, optional
        The scalar value to use for newly introduced missing values.
        the default depends on the dtype of `self`.
        For numeric data, ``np.nan`` is used.
        For datetime, timedelta, or period data, etc. :attr:`NaT` is used.
        For extension dtypes, ``self.dtype.na_value`` is used.
    
        .. versionchanged:: 0.24.0
    
    Returns
    -------
    Series
        Copy of input object, shifted.
    
    See Also
    --------
    Index.shift : Shift values of Index.
    DatetimeIndex.shift : Shift values of DatetimeIndex.
    PeriodIndex.shift : Shift values of PeriodIndex.
    tshift : Shift the time index, using the index's frequency if
        available.
    
    Examples
    --------
    >>> df = pd.DataFrame({'Col1': [10, 20, 15, 30, 45],
    ...                    'Col2': [13, 23, 18, 33, 48],
    ...                    'Col3': [17, 27, 22, 37, 52]})
    
    >>> df.shift(periods=3)
       Col1  Col2  Col3
    0   NaN   NaN   NaN
    1   NaN   NaN   NaN
    2   NaN   NaN   NaN
    3  10.0  13.0  17.0
    4  20.0  23.0  27.0
    
    >>> df.shift(periods=1, axis='columns')
       Col1  Col2  Col3
    0   NaN  10.0  13.0
    1   NaN  20.0  23.0
    2   NaN  15.0  18.0
    3   NaN  30.0  33.0
    4   NaN  45.0  48.0
    
    >>> df.shift(periods=3, fill_value=0)
       Col1  Col2  Col3
    0     0     0     0
    1     0     0     0
    2     0     0     0
    3    10    13    17
    4    20    23    27

ts

2020-01-31    0.225376
2020-02-29   -0.024710
2020-03-31    0.117686
2020-04-30    1.513727
Freq: M, dtype: float64

ts.shift(2)

2020-01-31         NaN
2020-02-29         NaN
2020-03-31    0.225376
2020-04-30   -0.024710
Freq: M, dtype: float64

ts.shift(-2)

2020-01-31    0.117686
2020-02-29    1.513727
2020-03-31         NaN
2020-04-30         NaN
Freq: M, dtype: float64

When we shift like this,missing data is introduced either at the start or the end of the time Series.

A common use of shift is computing percent changes in a time series or multiple time series as DataFrame columns.

Because naive shift leave the index unmodified,some data is discard.Thus if the frequency is known ,it can be passed to shift to advance the timestamps instead of simply the data:
Shift index by desired number of periods with an optional time freq.

When freq is not passed, shift the index without realigning the data.
If freq is passed (in this case, the index must be date or datetime,
or it will raise a NotImplementedError), the index will be
increased using the periods and the freq.

ts.shift(2,freq='M')

2020-03-31    0.225376
2020-04-30   -0.024710
2020-05-31    0.117686
2020-06-30    1.513727
Freq: M, dtype: float64

ts

2020-01-31    0.225376
2020-02-29   -0.024710
2020-03-31    0.117686
2020-04-30    1.513727
Freq: M, dtype: float64

Other frequencies can be passed too,giving you some flexibility in how to lead and lag the data:

ts.shift(3,freq='D') # shift every timestamp in ts forward 3 days

2020-02-03    0.225376
2020-03-03   -0.024710
2020-04-03    0.117686
2020-05-03    1.513727
dtype: float64

ts.shift(1,freq='90T')  # shift every timestamp in ts forward '90T'

2020-01-31 01:30:00    0.225376
2020-02-29 01:30:00   -0.024710
2020-03-31 01:30:00    0.117686
2020-04-30 01:30:00    1.513727
Freq: M, dtype: float64

Shift dates with offsets

The pandas date offsets can also be used with datetime or Timestamp objects:

from pandas.tseries.offsets import Day,MonthEnd

now=datetime(2020,11,17)

now+3*Day()

Timestamp('2020-11-20 00:00:00')

now+MonthEnd(2)

Timestamp('2020-12-31 00:00:00')

Anchored offsets can explicitly 'roll' dates forward or backward by simply using their rollforward and rollback methods,respectively:

offset=MonthEnd()

offset.rollforward(now)

Timestamp('2020-11-30 00:00:00')

offset.rollback(now)

Timestamp('2020-10-31 00:00:00')

A creative use of date offset is to use these methods with groupby:

ts=pd.Series(np.random.randn(20),index=pd.date_range('1/15/2000',periods=20,freq='4d'))

list(ts.groupby(offset.rollforward))

[(Timestamp('2000-01-31 00:00:00'),
  2000-01-15   -0.209800
  2000-01-19   -2.189881
  2000-01-23   -1.779681
  2000-01-27    0.437441
  2000-01-31    1.054685
  Freq: 4D, dtype: float64),
 (Timestamp('2000-02-29 00:00:00'),
  2000-02-04   -0.506648
  2000-02-08    0.484109
  2000-02-12   -0.385587
  2000-02-16   -0.732983
  2000-02-20   -1.459167
  2000-02-24   -1.133808
  2000-02-28    0.097860
  Freq: 4D, dtype: float64),
 (Timestamp('2000-03-31 00:00:00'),
  2000-03-03    0.480492
  2000-03-07    1.040105
  2000-03-11    0.634999
  2000-03-15    0.621187
  2000-03-19   -1.410100
  2000-03-23    0.319765
  2000-03-27   -1.079803
  2000-03-31   -1.292514
  Freq: 4D, dtype: float64)]

ts.groupby(offset.rollforward).mean()

2000-01-31   -0.537447
2000-02-29   -0.519461
2000-03-31   -0.085734
dtype: float64

Of course ,an easier and faster way to do this is using resample

ts.resample('M').mean()

2000-01-31   -0.537447
2000-02-29   -0.519461
2000-03-31   -0.085734
Freq: M, dtype: float64

Time zone handling

Working with time zone is generally considered one of the most unpleasant parts of time series manipulation.As a result,many time series users choose to work with time series in coordinated universal time or UTC,which is the successor to Greenwich Mean Time and is the current internaational standard.Time zones are expressed as offsets from UTC;

In Python,time zone information comes from the third-party pytz library,which exposes the Olson database,a compilation of world time zone information. Pandas wraps pytz's funtionality so you can ignore its API outside of the time zone names. Time zone names can be found interactively and in the docs.

import pytz

pytz.common_timezones[-5:]

['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC']

len(pytz.common_timezones)

To get a time zone object from pytz,use pytz.timezone:

tz=pytz.timezone('America/New_York')

tz

<DstTzInfo 'America/New_York' LMT-1 day, 19:04:00 STD>

Time zone localization and conversion

By default,time series in pandas are time zone naive, for example,consider the following time series:

rng=pd.date_range('3/9/2020 9:00',periods=6,freq='D')

ts=pd.Series(np.random.randn(len(rng)),index=rng)

ts

2020-03-09 09:00:00   -0.384109
2020-03-10 09:00:00   -0.195272
2020-03-11 09:00:00   -0.473277
2020-03-12 09:00:00    1.430223
2020-03-13 09:00:00   -0.222399
2020-03-14 09:00:00   -0.844174
Freq: D, dtype: float64

print(ts.index.tz) # print None,indicating that ts.index is naive time zone

None

The index's field is None. Date ranges can be generated with a time zone set:

pd.date_range('3/9/2020 9:30',periods=10,freq='D',tz='UTC')

DatetimeIndex(['2020-03-09 09:30:00+00:00', '2020-03-10 09:30:00+00:00',
               '2020-03-11 09:30:00+00:00', '2020-03-12 09:30:00+00:00',
               '2020-03-13 09:30:00+00:00', '2020-03-14 09:30:00+00:00',
               '2020-03-15 09:30:00+00:00', '2020-03-16 09:30:00+00:00',
               '2020-03-17 09:30:00+00:00', '2020-03-18 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

Conversion from naive to localized is handled by tz_localize:

ts

2020-03-09 09:00:00   -0.384109
2020-03-10 09:00:00   -0.195272
2020-03-11 09:00:00   -0.473277
2020-03-12 09:00:00    1.430223
2020-03-13 09:00:00   -0.222399
2020-03-14 09:00:00   -0.844174
Freq: D, dtype: float64

ts_utc=ts.tz_localize('utc')

ts_utc

2020-03-09 09:00:00+00:00   -0.384109
2020-03-10 09:00:00+00:00   -0.195272
2020-03-11 09:00:00+00:00   -0.473277
2020-03-12 09:00:00+00:00    1.430223
2020-03-13 09:00:00+00:00   -0.222399
2020-03-14 09:00:00+00:00   -0.844174
Freq: D, dtype: float64

ts_utc.index # dtype has been changed to be UTC

DatetimeIndex(['2020-03-09 09:00:00+00:00', '2020-03-10 09:00:00+00:00',
               '2020-03-11 09:00:00+00:00', '2020-03-12 09:00:00+00:00',
               '2020-03-13 09:00:00+00:00', '2020-03-14 09:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='D')

Operations with Time zone --Aware Timestamp objects

Similar to time series and date ranges,individual Timestamp objects similarly can be localized from naive to time-aware and converted from one time zone to another:

stamp=pd.Timestamp('2011-03-12 04:00')

print(stamp.tz)

None

stamp_utc=stamp.tz_localize('utc')

stamp_utc.tz_convert('America/New_York')

Timestamp('2011-03-11 23:00:00-0500', tz='America/New_York')

ts.index.tz_localize('utc').tz_convert('America/New_York') # naive shall be localized and then can be converted to anoter time zone

DatetimeIndex(['2020-03-09 05:00:00-04:00', '2020-03-10 05:00:00-04:00',
               '2020-03-11 05:00:00-04:00', '2020-03-12 05:00:00-04:00',
               '2020-03-13 05:00:00-04:00', '2020-03-14 05:00:00-04:00'],
              dtype='datetime64[ns, America/New_York]', freq='D')

ts.index.tz_localize('utc').tz

<UTC>

You can also pass a time zone when creating the Timestamp:

stamp_mscow=pd.Timestamp('2011-03-12 04:00',tz='Europe/Moscow')

stamp_mscow.tz

<DstTzInfo 'Europe/Moscow' MSK+3:00:00 STD>

Time zone-aware Timestamp objects internally store a UTC timestamp value as nano-seconds since the Unix-epoch(January 1,1970);this UTC value is invariant between time zone conversions:

stamp_mscow.value

1299891600000000000

stamp_mscow.tz_convert('America/New_York')

Timestamp('2011-03-11 20:00:00-0500', tz='America/New_York')

When performing time arithmetic using pandas's DateOffset objects,pandas respects daylight saving time transitions where possible.Here we construct timestamps that occur right before DST transitions.First,30 minutes before transitioning to DST:

from pandas.tseries.offsets import Hour

stamp=pd.Timestamp('2012-03-12 01:30',tz='US/Eastern')

stamp

Timestamp('2012-03-12 01:30:00-0400', tz='US/Eastern')

stamp+Hour()

Timestamp('2012-03-12 02:30:00-0400', tz='US/Eastern')

Then ,90 minutes before transitioning out of DST:

stamp=pd.Timestamp('2012-10-04 00:30',tz='US/Eastern')

stamp

Timestamp('2012-10-04 00:30:00-0400', tz='US/Eastern')

stamp+2*Hour()  ### 02：30

Timestamp('2012-10-04 02:30:00-0400', tz='US/Eastern')

stamp=pd.Timestamp('2012-11-04 00:30',tz='US/Eastern')

stamp

Timestamp('2012-11-04 00:30:00-0400', tz='US/Eastern')

stamp+2*Hour()  #01：30

Timestamp('2012-11-04 01:30:00-0500', tz='US/Eastern')

Operations between different time zones

If two time series with different time zones are combined,the result will be UTC.Since the timestamps are stored under the hood in UTC,this is a straightforward operation and requires no conversion to happen:

rng=pd.date_range('3/7/2012 9:30',periods=10,freq='B')

ts=pd.Series(np.random.randn(len(rng)),index=rng)

ts1=ts[:7].tz_localize('Europe/London')

ts2=ts1[2:].tz_convert('Europe/Moscow')

result=ts1+ts2

result.index

DatetimeIndex(['2012-03-07 09:30:00+00:00', '2012-03-08 09:30:00+00:00',
               '2012-03-09 09:30:00+00:00', '2012-03-12 09:30:00+00:00',
               '2012-03-13 09:30:00+00:00', '2012-03-14 09:30:00+00:00',
               '2012-03-15 09:30:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='B')

Periods and period arithmetic

Periods represent timespans,like days,months,quarters,or years.The Periodclass represents this data type,requiring a string or integer.

p=pd.Period(2007,freq='A-DEC') #Annual dates anchored on last calendar day of given month,yearEnd

Period('2007', 'A-DEC')

In this case,the Period object represents the full timespan from January1,2007,to December 31,2007,inclusive.Conveniently,adding integers from periods has the effect of shiftting by their frequency:

p+5

Period('2012', 'A-DEC')

p-2

Period('2005', 'A-DEC')

If two periods have the same frequency,their difference is the number of units between them:

pd.Period('2014','A-DEC')-p

<7 * YearEnds: month=12>

Regular ranges of periods can be constructed with the period_range function:

rng=pd.period_range('2000-01-01','2000-06-30',freq='M')

rng

PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='period[M]', freq='M')

The PeriodIndex class stores a sequence of periods and can serve as an axis index in any pandas data structure:

pd.Series(np.random.randn(6),index=rng)

2000-01    1.337147
2000-02   -0.201512
2000-03   -0.261829
2000-04    0.124229
2000-05   -0.723703
2000-06   -2.130917
Freq: M, dtype: float64

If you have an array of strings,you can also use the PeriodIndex class:

values=['2001Q3','2002Q2','2003Q1']

index=pd.PeriodIndex(values,freq='Q-DEC') # Quarterly dates anchored on last calendar day of each month,for year ending in indicated month.

index

PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[Q-DEC]', freq='Q-DEC')

Period frequency convension

Periods and PeriodIndex objects can be converted to another frequency with their asfreq method.As an example,suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year.This is fairly straightforward.

p=pd.Period('2007',freq='A-DEC') # Annual period,ending with the last day of 2007-12

Period('2007', 'A-DEC')

p.asfreq('M',how='start')

Period('2007-01', 'M')

p.asfreq('M',how='end')

Period('2007-12', 'M')

You can think of Period('2007','A-DEC') as being a sort of cursor pointing to a span of time,subdivided by monthly periods.

p=pd.Period('2007','A-JUN') #Annual period,ending with the last day of 2007-06

p.asfreq('M','start')

Period('2006-07', 'M')

p.asfreq('M','end')

Period('2007-06', 'M')

Whole PeriodIndex objects or time series can be similarly converted with the same semantics:

rng=pd.period_range('2006','2009',freq='A-DEC')

ts=pd.Series(np.random.randn(len(rng)),index=rng)

ts

2006   -0.297145
2007   -0.304496
2008    0.705818
2009   -1.829369
Freq: A-DEC, dtype: float64

ts.asfreq('M',how='start')

2006-01   -0.297145
2007-01   -0.304496
2008-01    0.705818
2009-01   -1.829369
Freq: M, dtype: float64

Here,the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period.If we instead wanted the last business day of each year, we can use 'B' frequency and indicate that we want the end of the period:

ts.asfreq('B',how='end')

2006-12-29   -0.297145
2007-12-31   -0.304496
2008-12-31    0.705818
2009-12-31   -1.829369
Freq: B, dtype: float64

Quarterly period frequencies

Quarterly data is stanard in accounting,finance,and other fields.Much quarterly data is reported relative to a fiscal year end,typically the last calender or business day of one of the 12 months of the year.Thus,the period 2012Q4 has a different meaning depending on fiscal year end.pandas supports all 12 possible quarterly frequencies as Q-JAN through Q-DEC.

p=pd.Period('2012Q4',freq='Q-JAN')

Period('2012Q4', 'Q-JAN')

In the case of fiscal year ending in January. 2012Q4 runs from November through January,whcin you can check by converting to daily frequency.

p.asfreq('D','start') #check p by converting to daily frequency.

Period('2011-11-01', 'D')

p.asfreq('D','end')

Period('2012-01-31', 'D')

Thus,it's possible to do easy period arithmetic,for example,to get the timestamp at 4PM on the second-to-last business day of the quarter,

p4pm=(p.asfreq('B','e')-1).asfreq('T','s')+16*60  #'T' means 'Minute'

p4pm

Period('2012-01-30 16:00', 'T')

p4pm.to_timestamp()

Timestamp('2012-01-30 16:00:00')

You can generate quarterly ranges using period_range.

rng=pd.period_range('2011Q3','2012Q4',freq='Q-JAN')

ts=pd.Series(np.arange(len(rng)),index=rng)

ts

2011Q3    0
2011Q4    1
2012Q1    2
2012Q2    3
2012Q3    4
2012Q4    5
Freq: Q-JAN, dtype: int32

new_rng=(rng.asfreq('B','e')-1).asfreq('T','s')+16*60

ts.index=new_rng.to_timestamp()

ts

2010-10-28 16:00:00    0
2011-01-28 16:00:00    1
2011-04-28 16:00:00    2
2011-07-28 16:00:00    3
2011-10-28 16:00:00    4
2012-01-30 16:00:00    5
dtype: int32

Converting Timestamps to Periods(and Back

Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period method:

rng=pd.date_range('2000-01-01',periods=3,freq='M')

ts=pd.Series(np.random.randn(3),index=rng);ts

2000-01-31   -0.115457
2000-02-29   -0.318769
2000-03-31    0.166398
Freq: M, dtype: float64

pts=ts.to_period();pts

2000-01   -0.115457
2000-02   -0.318769
2000-03    0.166398
Freq: M, dtype: float64

type(pts)

pandas.core.series.Series

Since periods refer to non-overlapping timespans,a timestamp can only belong to a single period for a given frequency.While the frequency of the new PeriodIndex is inferred from the timestamp by default,you can specify any frequency you want.

rng=pd.date_range('1/29/2000',periods=6,freq='D')

ts2=pd.Series(np.random.randn(6),index=rng)

ts2

2000-01-29    0.511537
2000-01-30    2.661260
2000-01-31    0.954388
2000-02-01   -0.903825
2000-02-02   -0.399345
2000-02-03    1.160727
Freq: D, dtype: float64

ts2.to_period('M')

2000-01    0.511537
2000-01    2.661260
2000-01    0.954388
2000-02   -0.903825
2000-02   -0.399345
2000-02    1.160727
Freq: M, dtype: float64

To convert back to timestamps,use to_timestamp:

pts=ts2.to_period()

pts.to_timestamp(how='end')

2000-01-29 23:59:59.999999999    0.511537
2000-01-30 23:59:59.999999999    2.661260
2000-01-31 23:59:59.999999999    0.954388
2000-02-01 23:59:59.999999999   -0.903825
2000-02-02 23:59:59.999999999   -0.399345
2000-02-03 23:59:59.999999999    1.160727
Freq: D, dtype: float64

Creating a PeriodIndex from Arrays

data=pd.read_csv('.pydata-book-2nd-editionexamplesmacrodata.csv')

data.head(5)

	year	quarter	realgdp	realcons	realinv	realgovt	realdpi	cpi	m1	tbilrate	unemp	pop	infl	realint
0	1959.0	1.0	2710.349	1707.4	286.898	470.045	1886.9	28.98	139.7	2.82	5.8	177.146	0.00	0.00
1	1959.0	2.0	2778.801	1733.7	310.859	481.301	1919.7	29.15	141.7	3.08	5.1	177.830	2.34	0.74
2	1959.0	3.0	2775.488	1751.8	289.226	491.260	1916.4	29.35	140.5	3.82	5.3	178.657	2.74	1.09
3	1959.0	4.0	2785.204	1753.7	299.356	484.052	1931.3	29.37	140.0	4.33	5.6	179.386	0.27	4.06
4	1960.0	1.0	2847.699	1770.5	331.722	462.199	1955.5	29.54	139.6	3.50	5.2	180.007	2.31	1.19

data.year

0      1959.0
1      1959.0
2      1959.0
3      1959.0
4      1960.0
        ...  
198    2008.0
199    2008.0
200    2009.0
201    2009.0
202    2009.0
Name: year, Length: 203, dtype: float64

data.quarter

0      1.0
1      2.0
2      3.0
3      4.0
4      1.0
      ... 
198    3.0
199    4.0
200    1.0
201    2.0
202    3.0
Name: quarter, Length: 203, dtype: float64

By passing these arrays to PeriodIndex with a frequency,you can combine them to form an index for the DataFrame:

index=pd.PeriodIndex(year=data.year,quarter=data.quarter,freq='Q-DEC')

index

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', length=203, freq='Q-DEC')

data.index=index

data.infl

1959Q1    0.00
1959Q2    2.34
1959Q3    2.74
1959Q4    0.27
1960Q1    2.31
          ... 
2008Q3   -3.16
2008Q4   -8.79
2009Q1    0.94
2009Q2    3.37
2009Q3    3.56
Freq: Q-DEC, Name: infl, Length: 203, dtype: float64

Resampling and frequency conversion

Resampling refers to the process of converting a time series from one frequency to another.Aggregating higher frequency data to lower frequency is called downsampling, while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories;

pandas objects are equipped with a resample method,which is the workhourse function for all frequency conversion.resample has a similar API to groupby;you can call resemple to group the data,then call an aggregate functions:

rng=pd.date_range('2000-01-01',periods=5,freq='D')

ts=pd.Series(np.arange(5),index=rng)

ts.head()

2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
Freq: D, dtype: int32

ts.resample('M').mean()

2000-01-31    2
Freq: M, dtype: int32

ts.resample('M',kind='period').mean()

2000-01    2
Freq: M, dtype: int32

help(pd.Series.resample)

Help on function resample in module pandas.core.generic:

resample(self, rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention='start', kind=None, loffset=None, limit=None, base=0, on=None, level=None)
    Resample time-series data.
    
    Convenience method for frequency conversion and resampling of time
    series. Object must have a datetime-like index (`DatetimeIndex`,
    `PeriodIndex`, or `TimedeltaIndex`), or pass datetime-like values
    to the `on` or `level` keyword.
    
    Parameters
    ----------
    rule : DateOffset, Timedelta or str
        The offset string or object representing target conversion.
    how : str
        Method for down/re-sampling, default to 'mean' for downsampling.
    
        .. deprecated:: 0.18.0
           The new syntax is ``.resample(...).mean()``, or
           ``.resample(...).apply(<func>)``
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Which axis to use for up- or down-sampling. For `Series` this
        will default to 0, i.e. along the rows. Must be
        `DatetimeIndex`, `TimedeltaIndex` or `PeriodIndex`.
    fill_method : str, default None
        Filling method for upsampling.
    
        .. deprecated:: 0.18.0
           The new syntax is ``.resample(...).<func>()``,
           e.g. ``.resample(...).pad()``
    closed : {'right', 'left'}, default None
        Which side of bin interval is closed. The default is 'left'
        for all frequency offsets except for 'M', 'A', 'Q', 'BM',
        'BA', 'BQ', and 'W' which all have a default of 'right'.
    label : {'right', 'left'}, default None
        Which bin edge label to label bucket with. The default is 'left'
        for all frequency offsets except for 'M', 'A', 'Q', 'BM',
        'BA', 'BQ', and 'W' which all have a default of 'right'.
    convention : {'start', 'end', 's', 'e'}, default 'start'
        For `PeriodIndex` only, controls whether to use the start or
        end of `rule`.
    kind : {'timestamp', 'period'}, optional, default None
        Pass 'timestamp' to convert the resulting index to a
        `DateTimeIndex` or 'period' to convert it to a `PeriodIndex`.
        By default the input representation is retained.
    loffset : timedelta, default None
        Adjust the resampled time labels.
    limit : int, default None
        Maximum size gap when reindexing with `fill_method`.
    
        .. deprecated:: 0.18.0
    base : int, default 0
        For frequencies that evenly subdivide 1 day, the "origin" of the
        aggregated intervals. For example, for '5min' frequency, base could
        range from 0 through 4. Defaults to 0.
    on : str, optional
        For a DataFrame, column to use instead of index for resampling.
        Column must be datetime-like.
    
        .. versionadded:: 0.19.0
    
    level : str or int, optional
        For a MultiIndex, level (name or number) to use for
        resampling. `level` must be datetime-like.
    
        .. versionadded:: 0.19.0
    
    Returns
    -------
    Resampler object
    
    See Also
    --------
    groupby : Group by mapping, function, label, or list of labels.
    Series.resample : Resample a Series.
    DataFrame.resample: Resample a DataFrame.
    
    Notes
    -----
    See the `user guide
    <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling>`_
    for more.
    
    To learn more about the offset strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects>`__.
    
    Examples
    --------
    
    Start by creating a series with 9 one minute timestamps.
    
    >>> index = pd.date_range('1/1/2000', periods=9, freq='T')
    >>> series = pd.Series(range(9), index=index)
    >>> series
    2000-01-01 00:00:00    0
    2000-01-01 00:01:00    1
    2000-01-01 00:02:00    2
    2000-01-01 00:03:00    3
    2000-01-01 00:04:00    4
    2000-01-01 00:05:00    5
    2000-01-01 00:06:00    6
    2000-01-01 00:07:00    7
    2000-01-01 00:08:00    8
    Freq: T, dtype: int64
    
    Downsample the series into 3 minute bins and sum the values
    of the timestamps falling into a bin.
    
    >>> series.resample('3T').sum()
    2000-01-01 00:00:00     3
    2000-01-01 00:03:00    12
    2000-01-01 00:06:00    21
    Freq: 3T, dtype: int64
    
    Downsample the series into 3 minute bins as above, but label each
    bin using the right edge instead of the left. Please note that the
    value in the bucket used as the label is not included in the bucket,
    which it labels. For example, in the original series the
    bucket ``2000-01-01 00:03:00`` contains the value 3, but the summed
    value in the resampled bucket with the label ``2000-01-01 00:03:00``
    does not include 3 (if it did, the summed value would be 6, not 3).
    To include this value close the right side of the bin interval as
    illustrated in the example below this one.
    
    >>> series.resample('3T', label='right').sum()
    2000-01-01 00:03:00     3
    2000-01-01 00:06:00    12
    2000-01-01 00:09:00    21
    Freq: 3T, dtype: int64
    
    Downsample the series into 3 minute bins as above, but close the right
    side of the bin interval.
    
    >>> series.resample('3T', label='right', closed='right').sum()
    2000-01-01 00:00:00     0
    2000-01-01 00:03:00     6
    2000-01-01 00:06:00    15
    2000-01-01 00:09:00    15
    Freq: 3T, dtype: int64
    
    Upsample the series into 30 second bins.
    
    >>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows
    2000-01-01 00:00:00   0.0
    2000-01-01 00:00:30   NaN
    2000-01-01 00:01:00   1.0
    2000-01-01 00:01:30   NaN
    2000-01-01 00:02:00   2.0
    Freq: 30S, dtype: float64
    
    Upsample the series into 30 second bins and fill the ``NaN``
    values using the ``pad`` method.
    
    >>> series.resample('30S').pad()[0:5]
    2000-01-01 00:00:00    0
    2000-01-01 00:00:30    0
    2000-01-01 00:01:00    1
    2000-01-01 00:01:30    1
    2000-01-01 00:02:00    2
    Freq: 30S, dtype: int64
    
    Upsample the series into 30 second bins and fill the
    ``NaN`` values using the ``bfill`` method.
    
    >>> series.resample('30S').bfill()[0:5]
    2000-01-01 00:00:00    0
    2000-01-01 00:00:30    1
    2000-01-01 00:01:00    1
    2000-01-01 00:01:30    2
    2000-01-01 00:02:00    2
    Freq: 30S, dtype: int64
    
    Pass a custom function via ``apply``
    
    >>> def custom_resampler(array_like):
    ...     return np.sum(array_like) + 5
    ...
    >>> series.resample('3T').apply(custom_resampler)
    2000-01-01 00:00:00     8
    2000-01-01 00:03:00    17
    2000-01-01 00:06:00    26
    Freq: 3T, dtype: int64
    
    For a Series with a PeriodIndex, the keyword `convention` can be
    used to control whether to use the start or end of `rule`.
    
    Resample a year by quarter using 'start' `convention`. Values are
    assigned to the first quarter of the period.
    
    >>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
    ...                                             freq='A',
    ...                                             periods=2))
    >>> s
    2012    1
    2013    2
    Freq: A-DEC, dtype: int64
    >>> s.resample('Q', convention='start').asfreq()
    2012Q1    1.0
    2012Q2    NaN
    2012Q3    NaN
    2012Q4    NaN
    2013Q1    2.0
    2013Q2    NaN
    2013Q3    NaN
    2013Q4    NaN
    Freq: Q-DEC, dtype: float64
    
    Resample quarters by month using 'end' `convention`. Values are
    assigned to the last month of the period.
    
    >>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
    ...                                                   freq='Q',
    ...                                                   periods=4))
    >>> q
    2018Q1    1
    2018Q2    2
    2018Q3    3
    2018Q4    4
    Freq: Q-DEC, dtype: int64
    >>> q.resample('M', convention='end').asfreq()
    2018-03    1.0
    2018-04    NaN
    2018-05    NaN
    2018-06    2.0
    2018-07    NaN
    2018-08    NaN
    2018-09    3.0
    2018-10    NaN
    2018-11    NaN
    2018-12    4.0
    Freq: M, dtype: float64
    
    For DataFrame objects, the keyword `on` can be used to specify the
    column instead of the index for resampling.
    
    >>> d = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
    ...           'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
    >>> df = pd.DataFrame(d)
    >>> df['week_starting'] = pd.date_range('01/01/2018',
    ...                                     periods=8,
    ...                                     freq='W')
    >>> df
       price  volume week_starting
    0     10      50    2018-01-07
    1     11      60    2018-01-14
    2      9      40    2018-01-21
    3     13     100    2018-01-28
    4     14      50    2018-02-04
    5     18     100    2018-02-11
    6     17      40    2018-02-18
    7     19      50    2018-02-25
    >>> df.resample('M', on='week_starting').mean()
                   price  volume
    week_starting
    2018-01-31     10.75    62.5
    2018-02-28     17.00    60.0
    
    For a DataFrame with MultiIndex, the keyword `level` can be used to
    specify on which level the resampling needs to take place.
    
    >>> days = pd.date_range('1/1/2000', periods=4, freq='D')
    >>> d2 = dict({'price': [10, 11, 9, 13, 14, 18, 17, 19],
    ...            'volume': [50, 60, 40, 100, 50, 100, 40, 50]})
    >>> df2 = pd.DataFrame(d2,
    ...                    index=pd.MultiIndex.from_product([days,
    ...                                                     ['morning',
    ...                                                      'afternoon']]
    ...                                                     ))
    >>> df2
                          price  volume
    2000-01-01 morning       10      50
               afternoon     11      60
    2000-01-02 morning        9      40
               afternoon     13     100
    2000-01-03 morning       14      50
               afternoon     18     100
    2000-01-04 morning       17      40
               afternoon     19      50
    >>> df2.resample('D', level=0).sum()
                price  volume
    2000-01-01     21     110
    2000-01-02     22     140
    2000-01-03     32     150
    2000-01-04     36      90

help(pd.MultiIndex.from_product)

Help on method from_product in module pandas.core.indexes.multi:

from_product(iterables, sortorder=None, names=None) method of builtins.type instance
    Make a MultiIndex from the cartesian product of multiple iterables.
    
    Parameters
    ----------
    iterables : list / sequence of iterables
        Each iterable has unique labels for each level of the index.
    sortorder : int or None
        Level of sortedness (must be lexicographically sorted by that
        level).
    names : list / sequence of str, optional
        Names for the levels in the index.
    
    Returns
    -------
    index : MultiIndex
    
    See Also
    --------
    MultiIndex.from_arrays : Convert list of arrays to MultiIndex.
    MultiIndex.from_tuples : Convert list of tuples to MultiIndex.
    MultiIndex.from_frame : Make a MultiIndex from a DataFrame.
    
    Examples
    --------
    >>> numbers = [0, 1, 2]
    >>> colors = ['green', 'purple']
    >>> pd.MultiIndex.from_product([numbers, colors],
    ...                            names=['number', 'color'])
    MultiIndex([(0,  'green'),
                (0, 'purple'),
                (1,  'green'),
                (1, 'purple'),
                (2,  'green'),
                (2, 'purple')],
               names=['number', 'color'])

tngt=pd.date_range('1/1/2020',periods=9,freq='T')

tngt

DatetimeIndex(['2020-01-01 00:00:00', '2020-01-01 00:01:00',
               '2020-01-01 00:02:00', '2020-01-01 00:03:00',
               '2020-01-01 00:04:00', '2020-01-01 00:05:00',
               '2020-01-01 00:06:00', '2020-01-01 00:07:00',
               '2020-01-01 00:08:00'],
              dtype='datetime64[ns]', freq='T')

sert=pd.Series(np.arange(9),index=tngt)

sert

2020-01-01 00:00:00    0
2020-01-01 00:01:00    1
2020-01-01 00:02:00    2
2020-01-01 00:03:00    3
2020-01-01 00:04:00    4
2020-01-01 00:05:00    5
2020-01-01 00:06:00    6
2020-01-01 00:07:00    7
2020-01-01 00:08:00    8
Freq: T, dtype: int32

sert.resample('30S')

<pandas.core.resample.DatetimeIndexResampler object at 0x0000014F20423860>

help(pd.core.resample.DatetimeIndexResampler.asfreq)

Help on function asfreq in module pandas.core.resample:

asfreq(self, fill_value=None)
    Return the values at the new freq, essentially a reindex.
    
    Parameters
    ----------
    fill_value : scalar, optional
        Value to use for missing values, applied during upsampling (note
        this does not fill NaNs that already were present).
    
        .. versionadded:: 0.20.0
    
    Returns
    -------
    DataFrame or Series
        Values at the specified freq.
    
    See Also
    --------
    Series.asfreq
    DataFrame.asfreq

sert.resample('30S').asfreq()

2020-01-01 00:00:00    0.0
2020-01-01 00:00:30    NaN
2020-01-01 00:01:00    1.0
2020-01-01 00:01:30    NaN
2020-01-01 00:02:00    2.0
2020-01-01 00:02:30    NaN
2020-01-01 00:03:00    3.0
2020-01-01 00:03:30    NaN
2020-01-01 00:04:00    4.0
2020-01-01 00:04:30    NaN
2020-01-01 00:05:00    5.0
2020-01-01 00:05:30    NaN
2020-01-01 00:06:00    6.0
2020-01-01 00:06:30    NaN
2020-01-01 00:07:00    7.0
2020-01-01 00:07:30    NaN
2020-01-01 00:08:00    8.0
Freq: 30S, dtype: float64

ts.resample('M',kind='period').mean()

2000-01    2
Freq: M, dtype: int32

ts

2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
Freq: D, dtype: int32

ts.resample('M',kind='timestamp').mean()

2000-01-31    2
Freq: M, dtype: int32

ts.resample('M').mean()

2000-01-31    2
Freq: M, dtype: int32

resample is a flexible and high-performance method that can be used to process very large time series.

The argument and its meaning:

freq: string or DateOffset indicating desired resampled frequency(e.g 'M','5min' ,or Second(15)
axis : axis to resample on;default axis=0
fill_method: how to interpolate when upsampling,as in 'ffill' or 'bfill'; by default does not interpolation
closed: in downsampling,which end of each interval is closed(inclusive),'right' or 'left'
label: in downsampling,how to label the aggregated result,with the 'right' or 'left' bin edge(e.g, the 9:30 to 9:35 five-minute interval could be labeled 9:30 or 9:35)
loffset:Time adjustment to the bin labels,such as '-1s'/Second(-1) to shift the aggregate labels one second erlier
limit:when forward or backward filling,the maxinum number of periods to fill
kind:Aggregate to periods('period') or timestamps('timestamp');default to the type of index the time series has
convention: when resampling periods,the convention('start' or 'end') for converting the low-frequency period to high frequency;defaults to 'end'.

Downsampling

Aggregating data to a regular,lower frequency is a pretty normal time series task.The data you are aggregating does not need to be fixed frequently;the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate.For example, to convert to monthly,'M' or 'BM',you need to chop up the data into one-month interval,and the union of the intervals must make up the whole time frame.There are a couple things to think about when using resemple to downsample data:

Which side of each interval is closed
How to label each aggregated bin,either with the start of the interval or the end

rng=pd.date_range('2000-01-01',periods=12,freq='T')

ts=pd.Series(np.arange(12),index=rng)

ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int32

ts.resample('5min',closed='right').sum() # when closed='right',bin is the form of (],the index of ts is the right side of bin
#(23:55:00],(00:00,00:05],(00:05,00:10],(00:10,00:15]

1999-12-31 23:55:00     0
2000-01-01 00:00:00    15
2000-01-01 00:05:00    40
2000-01-01 00:10:00    11
Freq: 5T, dtype: int32

ts.resample('5min',closed='right',label='right').sum() #The same as above,but showing labels
#(00:00,00:05],(00:05,00:10],(00:10,00:15]

2000-01-01 00:00:00     0
2000-01-01 00:05:00    15
2000-01-01 00:10:00    40
2000-01-01 00:15:00    11
Freq: 5T, dtype: int32

ts.resample('5min',closed='left').sum()# [00:00,00:05),[00:05,00:10),[00:10,00:15), when closed='left',bin is the form of [),and 
# the index of ts is the left side of bin.

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int32

ts.resample('5min',closed='left',label='right').sum()# The same as above,but showing labels.

2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
2000-01-01 00:15:00    21
Freq: 5T, dtype: int32

Lastly,you might want to shift the result index by some amount,say subtracting one second from the right edge to make it more clear which interval the timestamp refers to.To do this,pass a string or date offset to loffset:

ts.resample('5min',closed='right',label='right',loffset='-1s').sum()

1999-12-31 23:59:59     0
2000-01-01 00:04:59    15
2000-01-01 00:09:59    40
2000-01-01 00:14:59    11
Freq: 5T, dtype: int32

You also could have accomplished the effect of loffset by calling the shift method on the result without the loffset.

ts.resample('5min',closed='right').sum().shift(1,'-1s')

1999-12-31 23:54:59     0
1999-12-31 23:59:59    15
2000-01-01 00:04:59    40
2000-01-01 00:09:59    11
Freq: 5T, dtype: int32

Open-High-Low-Close(OHLC) resampling

In finance,a popular way to aggregate a time series is to compute four values for each bucket:the first(open),last(close),maximum(high),minimal(low) values.By using the ohlc aggregate function you will obtain a DataFrame having columns containing these four aggregates,which are effectively computed in a single sweep of the data:

ts.resample('5min').ohlc()

	open	high	low	close
2000-01-01 00:00:00	0	4	0	4
2000-01-01 00:05:00	5	9	5	9
2000-01-01 00:10:00	10	11	10	11

Upsampling and interpolation

When converting from a low frequency to a higher frequency,no aggregation is needed.

frame=pd.DataFrame(np.random.randn(2,4),index=pd.date_range('1/1/2000',periods=2,freq='W-WED'),columns=['Colorado','Texa','New York','Ohio'])

frame

	Colorado	Texa	New York	Ohio
2000-01-05	-0.739109	0.781223	0.570884	-1.616556
2000-01-12	-0.179902	2.531560	-2.658008	0.946870

When you are using an aggregation function with this data,there is only one value per group,and missinig values result in the gaps.We use the
asfreq mehtond to convert the higher frequency without any aggregation:

df_daily=frame.resample('D').asfreq()

df_daily

	Colorado	Texa	New York	Ohio
2000-01-05	-0.739109	0.781223	0.570884	-1.616556
2000-01-06	NaN	NaN	NaN	NaN
2000-01-07	NaN	NaN	NaN	NaN
2000-01-08	NaN	NaN	NaN	NaN
2000-01-09	NaN	NaN	NaN	NaN
2000-01-10	NaN	NaN	NaN	NaN
2000-01-11	NaN	NaN	NaN	NaN
2000-01-12	-0.179902	2.531560	-2.658008	0.946870

Suppose you wanted to fill forward each weekly value on the non-Wednesdays.The same filling or interpolation methods available in the fillna and reindex methods are available for resampling:

frame.resample('D').ffill()

	Colorado	Texa	New York	Ohio
2000-01-05	-0.739109	0.781223	0.570884	-1.616556
2000-01-06	-0.739109	0.781223	0.570884	-1.616556
2000-01-07	-0.739109	0.781223	0.570884	-1.616556
2000-01-08	-0.739109	0.781223	0.570884	-1.616556
2000-01-09	-0.739109	0.781223	0.570884	-1.616556
2000-01-10	-0.739109	0.781223	0.570884	-1.616556
2000-01-11	-0.739109	0.781223	0.570884	-1.616556
2000-01-12	-0.179902	2.531560	-2.658008	0.946870

You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:

frame.resample('D').ffill(limit=2)

	Colorado	Texa	New York	Ohio
2000-01-05	-0.739109	0.781223	0.570884	-1.616556
2000-01-06	-0.739109	0.781223	0.570884	-1.616556
2000-01-07	-0.739109	0.781223	0.570884	-1.616556
2000-01-08	NaN	NaN	NaN	NaN
2000-01-09	NaN	NaN	NaN	NaN
2000-01-10	NaN	NaN	NaN	NaN
2000-01-11	NaN	NaN	NaN	NaN
2000-01-12	-0.179902	2.531560	-2.658008	0.946870

Notably,the new date index need not overlap with the old one at all:

frame.resample('W-THU').ffill()

	Colorado	Texa	New York	Ohio
2000-01-06	-0.739109	0.781223	0.570884	-1.616556
2000-01-13	-0.179902	2.531560	-2.658008	0.946870

Resample with periods

Resampling data indexed by periods is similar to timestamps:

frame=pd.DataFrame(np.random.randn(24,4),index=pd.period_range('1-2000','12-2001',freq='M'),columns=['Colorado','Texas','New York','Ohio'])

frame[:5]

	Colorado	Texas	New York	Ohio
2000-01	-0.829993	-1.129430	1.320036	0.275144
2000-02	2.511115	-0.306541	0.472983	0.220395
2000-03	-0.037656	0.776638	0.428096	-0.274698
2000-04	-1.116895	-0.353303	-0.642274	1.469136
2000-05	0.975105	-1.160983	0.459956	-0.834690

annual_frame=frame.resample('A-DEC').mean()

annual_frame

	Colorado	Texas	New York	Ohio
2000	0.002307	-0.277951	-0.030337	-0.273060
2001	0.126581	0.275503	-0.209550	0.241073

Upsampling is more nuance ,as you must make a decision about which end of the timespan in the new frequency to place the values before resampling,just like the asfreq method.The convention argument defaults to start but can also be end.

annual_frame.resample('Q-DEC').ffill()

	Colorado	Texas	New York	Ohio
2000Q1	0.002307	-0.277951	-0.030337	-0.273060
2000Q2	0.002307	-0.277951	-0.030337	-0.273060
2000Q3	0.002307	-0.277951	-0.030337	-0.273060
2000Q4	0.002307	-0.277951	-0.030337	-0.273060
2001Q1	0.126581	0.275503	-0.209550	0.241073
2001Q2	0.126581	0.275503	-0.209550	0.241073
2001Q3	0.126581	0.275503	-0.209550	0.241073
2001Q4	0.126581	0.275503	-0.209550	0.241073

annual_frame.resample('Q-DEC',convention='end').ffill()

	Colorado	Texas	New York	Ohio
2000Q4	0.002307	-0.277951	-0.030337	-0.273060
2001Q1	0.002307	-0.277951	-0.030337	-0.273060
2001Q2	0.002307	-0.277951	-0.030337	-0.273060
2001Q3	0.002307	-0.277951	-0.030337	-0.273060
2001Q4	0.126581	0.275503	-0.209550	0.241073

Moving window functions

An important class of array transformations used for time Series operations are statistics and other functinos evaluated over a sliding window or with exponentially decaying weights.This can be useful for smoothing noisy or gappy data.

close_px_all=pd.read_csv(r'./pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)

close_px_all.head()

	AAPL	MSFT	XOM	SPX
2003-01-02	7.40	21.11	29.22	909.03
2003-01-03	7.45	21.14	29.24	908.59
2003-01-06	7.45	21.52	29.96	929.01
2003-01-07	7.43	21.93	28.95	922.93
2003-01-08	7.28	21.31	28.83	909.93

close_px=close_px_all[['AAPL','MSFT','XOM']]

close_px=close_px.resample('B').ffill()

rollingoperator behaves similarly to resample and groupby.It can be called on a Series or DataFrame along with a window:

close_px.AAPL.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x14f2085ccf8>

png

close_px.AAPL.rolling(300).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x14f20abec50>

png

close_px.AAPL.rolling(250).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x14f20b7fb00>

png

help(close_px.AAPL.rolling)

Help on method rolling in module pandas.core.generic:

rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None) method of pandas.core.series.Series instance
    Provide rolling window calculations.
    
    .. versionadded:: 0.18.0
    
    Parameters
    ----------
    window : int, or offset
        Size of the moving window. This is the number of observations used for
        calculating the statistic. Each window will be a fixed size.
    
        If its an offset then this will be the time period of each window. Each
        window will be a variable sized based on the observations included in
        the time-period. This is only valid for datetimelike indexes. This is
        new in 0.19.0
    min_periods : int, default None
        Minimum number of observations in window required to have a value
        (otherwise result is NA). For a window that is specified by an offset,
        `min_periods` will default to 1. Otherwise, `min_periods` will default
        to the size of the window.
    center : bool, default False
        Set the labels at the center of the window.
    win_type : str, default None
        Provide a window type. If ``None``, all points are evenly weighted.
        See the notes below for further information.
    on : str, optional
        For a DataFrame, a datetime-like column on which to calculate the rolling
        window, rather than the DataFrame's index. Provided integer column is
        ignored and excluded from result since an integer index is not used to
        calculate the rolling window.
    axis : int or str, default 0
    closed : str, default None
        Make the interval closed on the 'right', 'left', 'both' or
        'neither' endpoints.
        For offset-based windows, it defaults to 'right'.
        For fixed windows, defaults to 'both'. Remaining cases not implemented
        for fixed windows.
    
        .. versionadded:: 0.20.0
    
    Returns
    -------
    a Window or Rolling sub-classed for the particular operation
    
    See Also
    --------
    expanding : Provides expanding transformations.
    ewm : Provides exponential weighted functions.
    
    Notes
    -----
    By default, the result is set to the right edge of the window. This can be
    changed to the center of the window by setting ``center=True``.
    
    To learn more about the offsets & frequency strings, please see `this link
    <http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases>`__.
    
    The recognized win_types are:
    
    * ``boxcar``
    * ``triang``
    * ``blackman``
    * ``hamming``
    * ``bartlett``
    * ``parzen``
    * ``bohman``
    * ``blackmanharris``
    * ``nuttall``
    * ``barthann``
    * ``kaiser`` (needs beta)
    * ``gaussian`` (needs std)
    * ``general_gaussian`` (needs power, width)
    * ``slepian`` (needs width)
    * ``exponential`` (needs tau), center is set to None.
    
    If ``win_type=None`` all points are evenly weighted. To learn more about
    different window types see `scipy.signal window functions
    <https://docs.scipy.org/doc/scipy/reference/signal.html#window-functions>`__.
    
    Examples
    --------
    
    >>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]})
    >>> df
         B
    0  0.0
    1  1.0
    2  2.0
    3  NaN
    4  4.0
    
    Rolling sum with a window length of 2, using the 'triang'
    window type.
    
    >>> df.rolling(2, win_type='triang').sum()
         B
    0  NaN
    1  0.5
    2  1.5
    3  NaN
    4  NaN
    
    Rolling sum with a window length of 2, min_periods defaults
    to the window length.
    
    >>> df.rolling(2).sum()
         B
    0  NaN
    1  1.0
    2  3.0
    3  NaN
    4  NaN
    
    Same as above, but explicitly set the min_periods
    
    >>> df.rolling(2, min_periods=1).sum()
         B
    0  0.0
    1  1.0
    2  3.0
    3  2.0
    4  4.0
    
    A ragged (meaning not-a-regular frequency), time-indexed DataFrame
    
    >>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]},
    ...                   index = [pd.Timestamp('20130101 09:00:00'),
    ...                            pd.Timestamp('20130101 09:00:02'),
    ...                            pd.Timestamp('20130101 09:00:03'),
    ...                            pd.Timestamp('20130101 09:00:05'),
    ...                            pd.Timestamp('20130101 09:00:06')])
    
    >>> df
                           B
    2013-01-01 09:00:00  0.0
    2013-01-01 09:00:02  1.0
    2013-01-01 09:00:03  2.0
    2013-01-01 09:00:05  NaN
    2013-01-01 09:00:06  4.0
    
    Contrasting to an integer rolling window, this will roll a variable
    length window corresponding to the time period.
    The default for min_periods is 1.
    
    >>> df.rolling('2s').sum()
                           B
    2013-01-01 09:00:00  0.0
    2013-01-01 09:00:02  1.0
    2013-01-01 09:00:03  3.0
    2013-01-01 09:00:05  NaN
    2013-01-01 09:00:06  4.0

To illustrate the meaning of rolling,look at the following example:

index=pd.date_range('2020-5-5',periods=20)

df=pd.DataFrame(np.arange(20),index=index,columns=['test']);df

	test
2020-05-05	0
2020-05-06	1
2020-05-07	2
2020-05-08	3
2020-05-09	4
2020-05-10	5
2020-05-11	6
2020-05-12	7
2020-05-13	8
2020-05-14	9
2020-05-15	10
2020-05-16	11
2020-05-17	12
2020-05-18	13
2020-05-19	14
2020-05-20	15
2020-05-21	16
2020-05-22	17
2020-05-23	18
2020-05-24	19

df['sum']=df.test.rolling(3).sum()

df['mean']=df.test.rolling(3).mean()


df['mean1'] = df.test.rolling(3,min_periods=2).mean()

df['expanding']=df.test.expanding().mean()

df

	test	sum	mean	mean1	expanding
2020-05-05	0	NaN	NaN	NaN	0.0
2020-05-06	1	NaN	NaN	0.5	0.5
2020-05-07	2	3.0	1.0	1.0	1.0
2020-05-08	3	6.0	2.0	2.0	1.5
2020-05-09	4	9.0	3.0	3.0	2.0
2020-05-10	5	12.0	4.0	4.0	2.5
2020-05-11	6	15.0	5.0	5.0	3.0
2020-05-12	7	18.0	6.0	6.0	3.5
2020-05-13	8	21.0	7.0	7.0	4.0
2020-05-14	9	24.0	8.0	8.0	4.5
2020-05-15	10	27.0	9.0	9.0	5.0
2020-05-16	11	30.0	10.0	10.0	5.5
2020-05-17	12	33.0	11.0	11.0	6.0
2020-05-18	13	36.0	12.0	12.0	6.5
2020-05-19	14	39.0	13.0	13.0	7.0
2020-05-20	15	42.0	14.0	14.0	7.5
2020-05-21	16	45.0	15.0	15.0	8.0
2020-05-22	17	48.0	16.0	16.0	8.5
2020-05-23	18	51.0	17.0	17.0	9.0
2020-05-24	19	54.0	18.0	18.0	9.5

df['test'].rolling(2).agg([np.sum,np.mean]) #Using agg function to return several results.

	sum	mean
2020-05-05	NaN	NaN
2020-05-06	1.0	0.5
2020-05-07	3.0	1.5
2020-05-08	5.0	2.5
2020-05-09	7.0	3.5
2020-05-10	9.0	4.5
2020-05-11	11.0	5.5
2020-05-12	13.0	6.5
2020-05-13	15.0	7.5
2020-05-14	17.0	8.5
2020-05-15	19.0	9.5
2020-05-16	21.0	10.5
2020-05-17	23.0	11.5
2020-05-18	25.0	12.5
2020-05-19	27.0	13.5
2020-05-20	29.0	14.5
2020-05-21	31.0	15.5
2020-05-22	33.0	16.5
2020-05-23	35.0	17.5
2020-05-24	37.0	18.5

Look at the result above,since window=3,so the first two datas are Nan.In terms of 'sum',3=0+1+2,6=1+2+3,9=2+3+4...

The expression rolling(250) is similar in behaviour to groupby,but instead of grouping it, it creates an object that grouping over a 250-day sliding window.By default rolling functions require all of the values in the window to be non-Na.This behaviour can be changed to account for missing and ,in particular ,the fact that you will have fewer than window periods of data at the begining of the time series:

appl_std250=close_px.AAPL.rolling(250,min_periods=10).std()

appl_std250[5:12]

2003-01-09         NaN
2003-01-10         NaN
2003-01-13         NaN
2003-01-14         NaN
2003-01-15    0.077496
2003-01-16    0.074760
2003-01-17    0.112368
Freq: B, Name: AAPL, dtype: float64

appl_std250.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x14f23f70cf8>

png

In order to compute an expanding window mean,use the expanding operator instead of rolling.The expanding mean starts the time window from the begining of the time series and increases the size of the window until it encompasses the whole series.

To illustrate expanding,look at the following example:

df.test

2020-05-05     0
2020-05-06     1
2020-05-07     2
2020-05-08     3
2020-05-09     4
2020-05-10     5
2020-05-11     6
2020-05-12     7
2020-05-13     8
2020-05-14     9
2020-05-15    10
2020-05-16    11
2020-05-17    12
2020-05-18    13
2020-05-19    14
2020-05-20    15
2020-05-21    16
2020-05-22    17
2020-05-23    18
2020-05-24    19
Freq: D, Name: test, dtype: int32

df.test.expanding().mean()

2020-05-05    0.0
2020-05-06    0.5
2020-05-07    1.0
2020-05-08    1.5
2020-05-09    2.0
2020-05-10    2.5
2020-05-11    3.0
2020-05-12    3.5
2020-05-13    4.0
2020-05-14    4.5
2020-05-15    5.0
2020-05-16    5.5
2020-05-17    6.0
2020-05-18    6.5
2020-05-19    7.0
2020-05-20    7.5
2020-05-21    8.0
2020-05-22    8.5
2020-05-23    9.0
2020-05-24    9.5
Freq: D, Name: test, dtype: float64

Compared with rolling,expanding's window is varialble like cumsum, and rolling's window is fixed.

expanding_mean=appl_std250.expanding().mean()

expanding_mean

2003-01-02          NaN
2003-01-03          NaN
2003-01-06          NaN
2003-01-07          NaN
2003-01-08          NaN
                ...    
2011-10-10    18.521201
2011-10-11    18.524272
2011-10-12    18.527385
2011-10-13    18.530554
2011-10-14    18.533823
Freq: B, Name: AAPL, Length: 2292, dtype: float64

close_px.rolling(60).mean().plot(logy=True)

<matplotlib.axes._subplots.AxesSubplot at 0x14f240f87f0>

png

The rolling function also accepts a string indicating a fixed-size time offset rather than a set number of period.Using this notation can be useful for irregular time series.These are the same string that can be passed to resample.

ser=pd.Series(np.arange(6),pd.date_range('2020-05-05',periods=6));ser

2020-05-05    0
2020-05-06    1
2020-05-07    2
2020-05-08    3
2020-05-09    4
2020-05-10    5
Freq: D, dtype: int32

ser.rolling('2D').mean()

2020-05-05    0.0
2020-05-06    0.5
2020-05-07    1.5
2020-05-08    2.5
2020-05-09    3.5
2020-05-10    4.5
Freq: D, dtype: float64

ser.rolling('D').mean()

2020-05-05    0.0
2020-05-06    1.0
2020-05-07    2.0
2020-05-08    3.0
2020-05-09    4.0
2020-05-10    5.0
Freq: D, dtype: float64

close_px.rolling('20D').mean()

	AAPL	MSFT	XOM
2003-01-02	7.400000	21.110000	29.220000
2003-01-03	7.425000	21.125000	29.230000
2003-01-06	7.433333	21.256667	29.473333
2003-01-07	7.432500	21.425000	29.342500
2003-01-08	7.402000	21.402000	29.240000
...	...	...	...
2011-10-10	389.351429	25.602143	72.527857
2011-10-11	388.505000	25.674286	72.835000
2011-10-12	388.531429	25.810000	73.400714
2011-10-13	388.826429	25.961429	73.905000
2011-10-14	391.038000	26.048667	74.185333

2292 rows × 3 columns

Exponentially weighted functions

An alternative to using a static window size with equally weighted observations is to sepcify a constant decay factor to give more weight to more recent observations.There are a couple of ways to specify the decay factor. A popular one is using a span,which makes the result comparable to a simple moving window function with window size equal to the span.

Pandas has the ewmoperator to go along with rolling and expanding.Here is an example comparing a 60-day moving average of Apple's stock price with an EW moving average with span=60:

appl_px=close_px.AAPL['2006':'2007']

ma60=appl_px.rolling(30,min_periods=20).mean()

ewma60=appl_px.ewm(span=30).mean()

ma60.plot(label='simple MA')

<matplotlib.axes._subplots.AxesSubplot at 0x14f24293390>

png

ewma60.plot(label='EW MA')

<matplotlib.axes._subplots.AxesSubplot at 0x14f24201128>

png

User-defined moving window functions

The apply method on rolling and related methods provides a means to apply an array function of your own devising over a moving window.The only requirement is that the funciton produce a single value(a reduction) from each piece of the array.For example,while we can compute sample quantiles using rolling(...)quantile(q),we might be interested in the percentile rank of a particular value over the sample.

from scipy.stats import percentileofscore

score_at_2percent=lambda x:percentileofscore(x,0.02)

results=close_px.pct_change()

result=results.AAPL.rolling(250).apply(score_at_2percent)

C:Users旺仔QQ糖AppDataRoamingPythonPython36site-packagesipykernel\__main__.py:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
  if __name__ == '__main__':

result.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x14f2608ec88>

png

##### 愿你一寸一寸地攻城略地，一点一点地焕然一新 #####

查看全文

相关阅读:
容器（四）实现容器的底层技术【25】
容器（四）限制容器的 Block IO【24】
容器（四）限制容器对CPU的使用【23】
容器（四）限制容器对内存的使用【22】
容器（四）一张图搞懂容器所有操作【21】
容器（四）容器常用操作【20】
容器（四）运行容器的最佳实践【19】
容器（四）进入容器的方法【18】
容器（四）运行容器方法【17】
51单片机学习笔记

原文地址：https://www.cnblogs.com/johnyang/p/12830402.html