6. 重采样及频率转换
重采样(resample)表示将时间序列的频率进行转换的过程。可以分为降采样和升采样等。
pandas对象都有一个resample方法,可以进行频率转换。
In [5]: rng = pd.date_range('1/1/2000', periods=100, freq='D')
In [6]: ts = Series(np.random.randn(len(rng)), index=rng)
# 聚合后的值如何处理,使用mean(),默认即为mean,也可以使用sum,min等。
In [8]: ts.resample('M').mean()
Out[8]:
2000-01-31 -0.128802
2000-02-29 0.179255
2000-03-31 0.055778
2000-04-30 -0.736071
Freq: M, dtype: float64
In [9]: ts.resample('M', kind='period').mean()
Out[9]:
2000-01 -0.128802
2000-02 0.179255
2000-03 0.055778
2000-04 -0.736071
Freq: M, dtype: float64
6.1 降采样
# 12个每分钟 的采样
In [10]: rng = pd.date_range('1/1/2017', periods=12, freq='T')
In [11]: ts = Series(np.arange(12), index=rng)
In [12]: ts
Out[12]:
2017-01-01 00:00:00 0
2017-01-01 00:01:00 1
2017-01-01 00:02:00 2
...
2017-01-01 00:08:00 8
2017-01-01 00:09:00 9
2017-01-01 00:10:00 10
2017-01-01 00:11:00 11
Freq: T, dtype: int32
# 每隔五分钟采用,并将五分钟内的值求和,赋值到新的Series中。
# 默认 [0,4),前闭后开
In [14]: ts.resample('5min').sum()
Out[14]:
2017-01-01 00:00:00 10
2017-01-01 00:05:00 35
2017-01-01 00:10:00 21
Freq: 5T, dtype: int32
# 默认 closed就是left,
In [15]: ts.resample('5min', closed='left').sum()
Out[15]:
2017-01-01 00:00:00 10
2017-01-01 00:05:00 35
2017-01-01 00:10:00 21
Freq: 5T, dtype: int32
# 调整到右闭左开后,但是时间取值还是left
In [16]: ts.resample('5min', closed='right').sum()
Out[16]:
2016-12-31 23:55:00 0
2017-01-01 00:00:00 15
2017-01-01 00:05:00 40
2017-01-01 00:10:00 11
Freq: 5T, dtype: int32
# 时间取值也为left,默认
In [17]: ts.resample('5min', closed='left', label='left').sum()
Out[17]:
2017-01-01 00:00:00 10
2017-01-01 00:05:00 35
2017-01-01 00:10:00 21
Freq: 5T, dtype: int32
还可以调整offset
# 向前调整1秒
In [18]: ts.resample('5T', loffset='1s').sum()
Out[18]:
2017-01-01 00:00:01 10
2017-01-01 00:05:01 35
2017-01-01 00:10:01 21
Freq: 5T, dtype: int32
OHLC重采样
金融领域有一种ohlc重采样方式,即开盘、收盘、最大值和最小值。
In [19]: ts.resample('5min').ohlc()
Out[19]:
open high low close
2017-01-01 00:00:00 0 4 0 4
2017-01-01 00:05:00 5 9 5 9
2017-01-01 00:10:00 10 11 10 11
利用groupby进行重采样
In [20]: rng = pd.date_range('1/1/2017', periods=100, freq='D')
In [21]: ts = Series(np.arange(100), index=rng)
In [22]: ts.groupby(lambda x: x.month).mean()
Out[22]:
1 15.0
2 44.5
3 74.0
4 94.5
dtype: float64
In [23]: rng[0]
Out[23]: Timestamp('2017-01-01 00:00:00', offset='D')
In [24]: rng[0].month
Out[24]: 1
In [25]: ts.groupby(lambda x: x.weekday).mean()
Out[25]:
0 50.0
1 47.5
2 48.5
3 49.5
4 50.5
5 51.5
6 49.0
dtype: float64
6.2 升采样和插值
低频率到高频率的时候就会有缺失值,因此需要进行插值操作。
In [26]: frame = DataFrame(np.random.randn(2,4), index=pd.date_range('1/1/2017'
...: , periods=2, freq='W-WED'), columns=['Colorda','Texas','NewYork','Ohio
...: '])
In [27]: frame
Out[27]:
Colorda Texas NewYork Ohio
2017-01-04 1.666793 -0.478740 -0.544072 1.934226
2017-01-11 -0.407898 1.072648 1.079074 -2.922704
In [28]: df_daily = frame.resample('D')
In [30]: df_daily = frame.resample('D').mean()
In [31]: df_daily
Out[31]:
Colorda Texas NewYork Ohio
2017-01-04 1.666793 -0.478740 -0.544072 1.934226
2017-01-05 NaN NaN NaN NaN
2017-01-06 NaN NaN NaN NaN
2017-01-07 NaN NaN NaN NaN
2017-01-08 NaN NaN NaN NaN
2017-01-09 NaN NaN NaN NaN
2017-01-10 NaN NaN NaN NaN
2017-01-11 -0.407898 1.072648 1.079074 -2.922704
In [33]: frame.resample('D', fill_method='ffill')
C:UsersyangflAnaconda3Scriptsipython-script.py:1: FutureWarning: fill_metho
d is deprecated to .resample()
the new syntax is .resample(...).ffill()
if __name__ == '__main__':
Out[33]:
Colorda Texas NewYork Ohio
2017-01-04 1.666793 -0.478740 -0.544072 1.934226
2017-01-05 1.666793 -0.478740 -0.544072 1.934226
2017-01-06 1.666793 -0.478740 -0.544072 1.934226
2017-01-07 1.666793 -0.478740 -0.544072 1.934226
2017-01-08 1.666793 -0.478740 -0.544072 1.934226
2017-01-09 1.666793 -0.478740 -0.544072 1.934226
2017-01-10 1.666793 -0.478740 -0.544072 1.934226
2017-01-11 -0.407898 1.072648 1.079074 -2.922704
In [34]: frame.resample('D', fill_method='ffill', limit=2)
C:UsersyangflAnaconda3Scriptsipython-script.py:1: FutureWarning: fill_metho
d is deprecated to .resample()
the new syntax is .resample(...).ffill(limit=2)
if __name__ == '__main__':
Out[34]:
Colorda Texas NewYork Ohio
2017-01-04 1.666793 -0.478740 -0.544072 1.934226
2017-01-05 1.666793 -0.478740 -0.544072 1.934226
2017-01-06 1.666793 -0.478740 -0.544072 1.934226
2017-01-07 NaN NaN NaN NaN
2017-01-08 NaN NaN NaN NaN
2017-01-09 NaN NaN NaN NaN
2017-01-10 NaN NaN NaN NaN
2017-01-11 -0.407898 1.072648 1.079074 -2.922704
In [35]: frame.resample('W-THU', fill_method='ffill')
C:UsersyangflAnaconda3Scriptsipython-script.py:1: FutureWarning: fill_metho
d is deprecated to .resample()
the new syntax is .resample(...).ffill()
if __name__ == '__main__':
Out[35]:
Colorda Texas NewYork Ohio
2017-01-05 1.666793 -0.478740 -0.544072 1.934226
2017-01-12 -0.407898 1.072648 1.079074 -2.922704
In [38]: frame.resample('W-THU').ffill()
Out[38]:
Colorda Texas NewYork Ohio
2017-01-05 1.666793 -0.478740 -0.544072 1.934226
2017-01-12 -0.407898 1.072648 1.079074 -2.922704
6.3 通过时期(period)进行重采样
# 创建一个每月随机数据,两年
In [41]: frame = DataFrame(np.random.randn(24,4), index=pd.date_range('1-2017',
...: '1-2019', freq='M'), columns=['Colorda','Texas','NewYork','Ohio'])
# 每年平均值进行重采样
In [42]: a_frame = frame.resample('A-DEC').mean()
In [43]: a_frame
Out[43]:
Colorda Texas NewYork Ohio
2017-12-31 -0.441948 -0.040711 0.036633 -0.328769
2018-12-31 -0.121778 0.181043 -0.004376 0.085500
# 按季度进行采用
In [45]: a_frame.resample('Q-DEC').ffill()
Out[45]:
Colorda Texas NewYork Ohio
2017-12-31 -0.441948 -0.040711 0.036633 -0.328769
2018-03-31 -0.441948 -0.040711 0.036633 -0.328769
2018-06-30 -0.441948 -0.040711 0.036633 -0.328769
2018-09-30 -0.441948 -0.040711 0.036633 -0.328769
2018-12-31 -0.121778 0.181043 -0.004376 0.085500
In [49]: frame.resample('Q-DEC').mean()
Out[49]:
Colorda Texas NewYork Ohio
2017-03-31 -0.445315 0.488191 -0.543567 -0.459284
2017-06-30 -0.157438 -0.680145 0.295301 -0.118013
2017-09-30 -0.151736 0.092512 0.684201 -0.035097
2017-12-31 -1.013302 -0.063404 -0.289404 -0.702681
2018-03-31 0.157538 -0.175134 -0.548305 0.609768
2018-06-30 -0.231697 -0.094108 0.224245 -0.151958
2018-09-30 -0.614219 0.308801 -0.205952 0.154302
2018-12-31 0.201266 0.684613 0.512506 -0.270111
7. 时间序列绘图
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import Series,DataFrame
frame = DataFrame(np.random.randn(20,3),
index = pd.date_range('1/1/2017', periods=20, freq='M'),
columns=['randn1','randn2','randn3']
)
frame.plot()
8. 移动窗口函数
待续。。。
9. 性能和内存使用方面的注意事项
In [50]: rng = pd.date_range('1/1/2017', periods=10000000, freq='1s')
In [51]: ts = Series(np.random.randn(len(rng)), index=rng)
In [52]: %timeit ts.resample('15s').ohlc()
1 loop, best of 3: 222 ms per loop
In [53]: %timeit ts.resample('15min').ohlc()
10 loops, best of 3: 152 ms per loop
貌似现在还有所下降。