zoukankan      html  css  js  c++  java
  • 时间序列学习笔记4

    6. 重采样及频率转换

    重采样(resample)表示将时间序列的频率进行转换的过程。可以分为降采样和升采样等。

    pandas对象都有一个resample方法,可以进行频率转换。

    In [5]: rng = pd.date_range('1/1/2000', periods=100, freq='D')
    
    In [6]: ts = Series(np.random.randn(len(rng)), index=rng)
    # 聚合后的值如何处理,使用mean(),默认即为mean,也可以使用sum,min等。
    In [8]: ts.resample('M').mean()
    Out[8]:
    2000-01-31   -0.128802
    2000-02-29    0.179255
    2000-03-31    0.055778
    2000-04-30   -0.736071
    Freq: M, dtype: float64
    
    In [9]: ts.resample('M', kind='period').mean()
    Out[9]:
    2000-01   -0.128802
    2000-02    0.179255
    2000-03    0.055778
    2000-04   -0.736071
    Freq: M, dtype: float64
    

    6.1 降采样

    # 12个每分钟 的采样
    In [10]: rng = pd.date_range('1/1/2017', periods=12, freq='T')
    
    In [11]: ts = Series(np.arange(12), index=rng)
    
    In [12]: ts
    Out[12]:
    2017-01-01 00:00:00     0
    2017-01-01 00:01:00     1
    2017-01-01 00:02:00     2
    ...
    2017-01-01 00:08:00     8
    2017-01-01 00:09:00     9
    2017-01-01 00:10:00    10
    2017-01-01 00:11:00    11
    Freq: T, dtype: int32
    
    # 每隔五分钟采用,并将五分钟内的值求和,赋值到新的Series中。
    # 默认 [0,4),前闭后开
    In [14]: ts.resample('5min').sum()  
    Out[14]:
    2017-01-01 00:00:00    10
    2017-01-01 00:05:00    35
    2017-01-01 00:10:00    21
    Freq: 5T, dtype: int32
    
    # 默认 closed就是left,
    In [15]: ts.resample('5min', closed='left').sum()
    Out[15]:
    2017-01-01 00:00:00    10
    2017-01-01 00:05:00    35
    2017-01-01 00:10:00    21
    Freq: 5T, dtype: int32
    
    # 调整到右闭左开后,但是时间取值还是left
    In [16]: ts.resample('5min', closed='right').sum()
    Out[16]:
    2016-12-31 23:55:00     0
    2017-01-01 00:00:00    15
    2017-01-01 00:05:00    40
    2017-01-01 00:10:00    11
    Freq: 5T, dtype: int32
    
    # 时间取值也为left,默认
    In [17]: ts.resample('5min', closed='left', label='left').sum()
    Out[17]:
    2017-01-01 00:00:00    10
    2017-01-01 00:05:00    35
    2017-01-01 00:10:00    21
    Freq: 5T, dtype: int32
    

    还可以调整offset

    # 向前调整1秒
    In [18]: ts.resample('5T', loffset='1s').sum()
    Out[18]:
    2017-01-01 00:00:01    10
    2017-01-01 00:05:01    35
    2017-01-01 00:10:01    21
    Freq: 5T, dtype: int32
    

    OHLC重采样

    金融领域有一种ohlc重采样方式,即开盘、收盘、最大值和最小值。

    In [19]: ts.resample('5min').ohlc()
    Out[19]:
                         open  high  low  close
    2017-01-01 00:00:00     0     4    0      4
    2017-01-01 00:05:00     5     9    5      9
    2017-01-01 00:10:00    10    11   10     11
    

    利用groupby进行重采样

    In [20]: rng = pd.date_range('1/1/2017', periods=100, freq='D')
    
    In [21]: ts = Series(np.arange(100), index=rng)
    
    In [22]: ts.groupby(lambda x: x.month).mean()
    Out[22]:
    1    15.0
    2    44.5
    3    74.0
    4    94.5
    dtype: float64
    
    In [23]: rng[0]
    Out[23]: Timestamp('2017-01-01 00:00:00', offset='D')
    
    In [24]: rng[0].month
    Out[24]: 1
    
    In [25]: ts.groupby(lambda x: x.weekday).mean()
    Out[25]:
    0    50.0
    1    47.5
    2    48.5
    3    49.5
    4    50.5
    5    51.5
    6    49.0
    dtype: float64
    

    6.2 升采样和插值

    低频率到高频率的时候就会有缺失值,因此需要进行插值操作。

    In [26]: frame = DataFrame(np.random.randn(2,4), index=pd.date_range('1/1/2017'
        ...: , periods=2, freq='W-WED'), columns=['Colorda','Texas','NewYork','Ohio
        ...: '])
    
    In [27]: frame
    Out[27]:
                 Colorda     Texas   NewYork      Ohio
    2017-01-04  1.666793 -0.478740 -0.544072  1.934226
    2017-01-11 -0.407898  1.072648  1.079074 -2.922704
    
    In [28]: df_daily = frame.resample('D')
    
    In [30]: df_daily = frame.resample('D').mean()
    
    In [31]: df_daily
    Out[31]:
                 Colorda     Texas   NewYork      Ohio
    2017-01-04  1.666793 -0.478740 -0.544072  1.934226
    2017-01-05       NaN       NaN       NaN       NaN
    2017-01-06       NaN       NaN       NaN       NaN
    2017-01-07       NaN       NaN       NaN       NaN
    2017-01-08       NaN       NaN       NaN       NaN
    2017-01-09       NaN       NaN       NaN       NaN
    2017-01-10       NaN       NaN       NaN       NaN
    2017-01-11 -0.407898  1.072648  1.079074 -2.922704
    
    
    In [33]: frame.resample('D', fill_method='ffill')
    C:UsersyangflAnaconda3Scriptsipython-script.py:1: FutureWarning: fill_metho
    d is deprecated to .resample()
    the new syntax is .resample(...).ffill()
      if __name__ == '__main__':
    Out[33]:
                 Colorda     Texas   NewYork      Ohio
    2017-01-04  1.666793 -0.478740 -0.544072  1.934226
    2017-01-05  1.666793 -0.478740 -0.544072  1.934226
    2017-01-06  1.666793 -0.478740 -0.544072  1.934226
    2017-01-07  1.666793 -0.478740 -0.544072  1.934226
    2017-01-08  1.666793 -0.478740 -0.544072  1.934226
    2017-01-09  1.666793 -0.478740 -0.544072  1.934226
    2017-01-10  1.666793 -0.478740 -0.544072  1.934226
    2017-01-11 -0.407898  1.072648  1.079074 -2.922704
    
    In [34]: frame.resample('D', fill_method='ffill', limit=2)
    C:UsersyangflAnaconda3Scriptsipython-script.py:1: FutureWarning: fill_metho
    d is deprecated to .resample()
    the new syntax is .resample(...).ffill(limit=2)
      if __name__ == '__main__':
    Out[34]:
                 Colorda     Texas   NewYork      Ohio
    2017-01-04  1.666793 -0.478740 -0.544072  1.934226
    2017-01-05  1.666793 -0.478740 -0.544072  1.934226
    2017-01-06  1.666793 -0.478740 -0.544072  1.934226
    2017-01-07       NaN       NaN       NaN       NaN
    2017-01-08       NaN       NaN       NaN       NaN
    2017-01-09       NaN       NaN       NaN       NaN
    2017-01-10       NaN       NaN       NaN       NaN
    2017-01-11 -0.407898  1.072648  1.079074 -2.922704
    
    In [35]: frame.resample('W-THU', fill_method='ffill')
    C:UsersyangflAnaconda3Scriptsipython-script.py:1: FutureWarning: fill_metho
    d is deprecated to .resample()
    the new syntax is .resample(...).ffill()
      if __name__ == '__main__':
    Out[35]:
                 Colorda     Texas   NewYork      Ohio
    2017-01-05  1.666793 -0.478740 -0.544072  1.934226
    2017-01-12 -0.407898  1.072648  1.079074 -2.922704
    
    In [38]: frame.resample('W-THU').ffill()
    Out[38]:
                 Colorda     Texas   NewYork      Ohio
    2017-01-05  1.666793 -0.478740 -0.544072  1.934226
    2017-01-12 -0.407898  1.072648  1.079074 -2.922704
    

    6.3 通过时期(period)进行重采样

    # 创建一个每月随机数据,两年
    In [41]: frame = DataFrame(np.random.randn(24,4), index=pd.date_range('1-2017',
        ...: '1-2019', freq='M'), columns=['Colorda','Texas','NewYork','Ohio'])
    
    # 每年平均值进行重采样
    In [42]: a_frame = frame.resample('A-DEC').mean()
    
    In [43]: a_frame
    Out[43]:
                 Colorda     Texas   NewYork      Ohio
    2017-12-31 -0.441948 -0.040711  0.036633 -0.328769
    2018-12-31 -0.121778  0.181043 -0.004376  0.085500
    
    # 按季度进行采用
    In [45]: a_frame.resample('Q-DEC').ffill()
    Out[45]:
                 Colorda     Texas   NewYork      Ohio
    2017-12-31 -0.441948 -0.040711  0.036633 -0.328769
    2018-03-31 -0.441948 -0.040711  0.036633 -0.328769
    2018-06-30 -0.441948 -0.040711  0.036633 -0.328769
    2018-09-30 -0.441948 -0.040711  0.036633 -0.328769
    2018-12-31 -0.121778  0.181043 -0.004376  0.085500
    
    In [49]: frame.resample('Q-DEC').mean()
    Out[49]:
                 Colorda     Texas   NewYork      Ohio
    2017-03-31 -0.445315  0.488191 -0.543567 -0.459284
    2017-06-30 -0.157438 -0.680145  0.295301 -0.118013
    2017-09-30 -0.151736  0.092512  0.684201 -0.035097
    2017-12-31 -1.013302 -0.063404 -0.289404 -0.702681
    2018-03-31  0.157538 -0.175134 -0.548305  0.609768
    2018-06-30 -0.231697 -0.094108  0.224245 -0.151958
    2018-09-30 -0.614219  0.308801 -0.205952  0.154302
    2018-12-31  0.201266  0.684613  0.512506 -0.270111
    

    7. 时间序列绘图

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    from pandas import Series,DataFrame
    
    frame = DataFrame(np.random.randn(20,3),
                      index = pd.date_range('1/1/2017', periods=20, freq='M'),
                    columns=['randn1','randn2','randn3']
                    )
    frame.plot()
    

    8. 移动窗口函数

    待续。。。

    9. 性能和内存使用方面的注意事项

    In [50]: rng = pd.date_range('1/1/2017', periods=10000000, freq='1s')
    
    In [51]: ts = Series(np.random.randn(len(rng)), index=rng)
    
    In [52]: %timeit ts.resample('15s').ohlc()
    1 loop, best of 3: 222 ms per loop
    
    In [53]: %timeit ts.resample('15min').ohlc()
    10 loops, best of 3: 152 ms per loop
    

    貌似现在还有所下降。

  • 相关阅读:
    Apriori算法--关联规则挖掘
    分布式系统阅读笔记(十九)-----移动计算和无处不在的计算
    分布式系统阅读笔记(十九)-----移动计算和无处不在的计算
    分布式系统阅读笔记(十九)-----移动计算和无处不在的计算
    分布式系统阅读笔记(二十)-----分布式多媒体系统
    分布式系统阅读笔记(二十)-----分布式多媒体系统
    分布式系统阅读笔记(二十)-----分布式多媒体系统
    FP-Tree频繁模式树算法
    hdu Load Balancing
    hdu 3348 coins
  • 原文地址:https://www.cnblogs.com/felo/p/6426429.html
Copyright © 2011-2022 走看看