zoukankan      html  css  js  c++  java
  • 时间序列学习笔记2

    2. 时间序列基础

    In [7]: dates = [(2011,1,1),(2011,2,3),(2011,2,4),(2011,4,23),(2011,4,22),(2011,
       ...: 4,1)]
    
    In [8]: dates = [datetime(*x) for x in dates]
    
    In [14]: ts = Series(np.random.randn(6), index=dates)
    
    # 创建一个以时间戳为index的Series。
    In [15]: ts
    Out[15]:
    2011-01-01    3.627969
    2011-02-03    0.731217
    2011-02-04    1.178071
    2011-04-23   -2.085412
    2011-04-22   -0.093829
    2011-04-01   -0.157532
    dtype: float64
    
    In [16]: type(ts)
    Out[16]: pandas.core.series.Series
    
    In [17]: ts.index
    Out[17]:
    DatetimeIndex(['2011-01-01', '2011-02-03', '2011-02-04', '2011-04-23',
                   '2011-04-22', '2011-04-01'],
                  dtype='datetime64[ns]', freq=None)
    
    # 和普通的Series一样,可以做Series相加
    In [19]: ts + ts[::2]
    
    2011-01-01    7.255939
    2011-02-03         NaN
    2011-02-04    2.356142
    2011-04-01         NaN
    2011-04-22   -0.187658
    2011-04-23         NaN
    dtype: float64
    
    # 时间序列的index类型为datetime64,单位是纳秒
    In [20]: ts.index.dtype
    Out[20]: dtype('<M8[ns]')
    
    In [21]: stamp = ts.index[0]
    
    In [22]: stamp
    Out[22]: Timestamp('2011-01-01 00:00:00')
    
    

    2.1 索引、选取和子集的构造

    索引

    # 可以使用datetime格式的索引
    In [24]: stamp = ts.index[2]
    
    In [25]: ts[stamp]
    Out[25]: 1.1780707665960897
    
    # 也可以使用常用日期格式的字符串类型作为索引。
    In [27]: ts['01/01/2011']
    Out[27]:
    2011-01-01    3.627969
    dtype: float64
    
    In [28]: ts['20110101']
    Out[28]:
    2011-01-01    3.627969
    dtype: float64
    
    

    切片

    # 通过日期来直接切片,但是只对Series有效。
    # pd.date_range可以将创建时间序列
    In [29]: longer_ts = Series(np.random.randn(1000), index=pd.date_range('1/1/2017
        ...: ',periods=1000))
    
    In [30]: longer_ts[:5]
    Out[30]:
    2017-01-01    0.311815
    2017-01-02   -0.424868
    2017-01-03    0.198069
    2017-01-04    1.011494
    2017-01-05   -0.312494
    Freq: D, dtype: float64
    
    In [31]: longer_ts[-5:]
    Out[31]:
    2019-09-23   -0.637869
    2019-09-24    0.721613
    2019-09-25   -0.914481
    2019-09-26    0.036966
    2019-09-27    0.677846
    Freq: D, dtype: float64
    
    # 获取2017-2月的所有数据
    In [32]: longer_ts['2017-2']
    Out[32]:
    2017-02-01    1.258390
    2017-02-02    0.606618
    2017-02-03    0.927122
    2017-02-04    0.761009
    ...
    2017-02-23   -1.039703
    2017-02-24    0.478075
    2017-02-25   -0.328411
    2017-02-26   -1.019641
    2017-02-27    0.186212
    2017-02-28   -1.466734
    Freq: D, dtype: float64
    
    # 单日数据
    In [33]: longer_ts['2017-2-3']
    Out[33]: 0.92712152603736908
    
    # 年数据
    In [34]: longer_ts['2017'][:5]
    Out[34]:
    2017-01-01    0.311815
    2017-01-02   -0.424868
    2017-01-03    0.198069
    2017-01-04    1.011494
    2017-01-05   -0.312494
    Freq: D, dtype: float64
    
    
    也可以通过不存在的时间戳对Series进行切片。
    
    

    2.带有重复索引的时间序列

    In [35]: dates = pd.DatetimeIndex(['1/1/2000','1/2/2000','1/2/2000','1/2/2000','
        ...: 1/3/2000'])
    
    In [36]: dup_ts = Series(np.arange(5), index=dates)
    
    In [37]: dup_ts
    Out[37]:
    2000-01-01    0
    2000-01-02    1
    2000-01-02    2
    2000-01-02    3
    2000-01-03    4
    dtype: int64
    
    # 查看索引是否重复
    In [40]: dup_ts.index.is_unique
    Out[40]: False
    
    In [41]: dup_ts['1/2/2000']  # 重复, 数组
    Out[41]:
    2000-01-02    1
    2000-01-02    2
    2000-01-02    3
    dtype: int64
    
    In [42]: dup_ts['1/3/2000']  # 不重复,标量
    Out[42]: 4
    
    
    
    In [43]: grouped = dup_ts.groupby(level=0)
    
    In [44]: grouped.mean()
    Out[44]:
    2000-01-01    0
    2000-01-02    2
    2000-01-03    4
    dtype: int64
    
    In [45]: grouped.count()
    Out[45]:
    2000-01-01    1
    2000-01-02    3
    2000-01-03    1
    dtype: int64
    
    

    3. 日期的范围、频率及移动

    pandas中的时间序列一般是不规则的,没有固定的频率。但是通常需要一某种频率对序列进行分析,
    幸运的是pandas有一套工具,帮助我们解决这些问题。

    resample

    In [49]: dates = pd.DatetimeIndex(['2000-01-02','2000-01-05','2000-01-07','2000-
        ...: 01-08','2000-01-10','2000-01-12'])
    
    In [50]: ts = Series(np.random.randn(6), index=dates)
    
    In [51]: ts
    Out[51]:
    2000-01-02    0.124049
    2000-01-05   -0.840846
    2000-01-07   -0.051655
    2000-01-08   -0.603824
    2000-01-10    0.467815
    2000-01-12   -0.201388
    dtype: float64
    
    In [52]: ts.resample('D')
    Out[52]: /Users/yangfeilong/anaconda/lib/python2.7/site-packages/IPython/utils/dir2.py:65: FutureWarning: .resample() is now a deferred operation
    use .resample(...).mean() instead of .resample(...)
      canary = getattr(obj, '_ipython_canary_method_should_not_exist_', None)
    DatetimeIndexResampler [freq=<Day>, axis=0, closed=left, label=left, convention=start, base=0]
    
    In [53]: ts.resample('D').mean()   # 填充空日期
    Out[53]:
    2000-01-02    0.124049
    2000-01-03         NaN
    2000-01-04         NaN
    2000-01-05   -0.840846
    2000-01-06         NaN
    2000-01-07   -0.051655
    2000-01-08   -0.603824
    2000-01-09         NaN
    2000-01-10    0.467815
    2000-01-11         NaN
    2000-01-12   -0.201388
    Freq: D, dtype: float64
    
    

    3.1 生成日期范围

    pandas.date_range可以生成指定长度的日期范围。

    In [54]: index = pd.date_range('4/1/2017','6/1/2017') # 生成一段时间的序列,默认00:00
    
    In [55]: index
    Out[55]:
    DatetimeIndex(['2017-04-01', '2017-04-02', '2017-04-03', '2017-04-04',
                   '2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08',
                   '2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12',
                   '2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16',
                   '2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20',
                   '2017-04-21', '2017-04-22', '2017-04-23', '2017-04-24',
                   '2017-04-25', '2017-04-26', '2017-04-27', '2017-04-28',
                   '2017-04-29', '2017-04-30', '2017-05-01', '2017-05-02',
                   '2017-05-03', '2017-05-04', '2017-05-05', '2017-05-06',
                   '2017-05-07', '2017-05-08', '2017-05-09', '2017-05-10',
                   '2017-05-11', '2017-05-12', '2017-05-13', '2017-05-14',
                   '2017-05-15', '2017-05-16', '2017-05-17', '2017-05-18',
                   '2017-05-19', '2017-05-20', '2017-05-21', '2017-05-22',
                   '2017-05-23', '2017-05-24', '2017-05-25', '2017-05-26',
                   '2017-05-27', '2017-05-28', '2017-05-29', '2017-05-30',
                   '2017-05-31', '2017-06-01'],
                  dtype='datetime64[ns]', freq='D')
    
    In [56]: pd.date_range(start='4/1/2017',periods=20)  # 指定长度
    Out[56]:
    DatetimeIndex(['2017-04-01', '2017-04-02', '2017-04-03', '2017-04-04',
                   '2017-04-05', '2017-04-06', '2017-04-07', '2017-04-08',
                   '2017-04-09', '2017-04-10', '2017-04-11', '2017-04-12',
                   '2017-04-13', '2017-04-14', '2017-04-15', '2017-04-16',
                   '2017-04-17', '2017-04-18', '2017-04-19', '2017-04-20'],
                  dtype='datetime64[ns]', freq='D')
    
    In [57]: pd.date_range(end='4/1/2017',periods=20)  # 指定结束日期
    Out[57]:
    DatetimeIndex(['2017-03-13', '2017-03-14', '2017-03-15', '2017-03-16',
                   '2017-03-17', '2017-03-18', '2017-03-19', '2017-03-20',
                   '2017-03-21', '2017-03-22', '2017-03-23', '2017-03-24',
                   '2017-03-25', '2017-03-26', '2017-03-27', '2017-03-28',
                   '2017-03-29', '2017-03-30', '2017-03-31', '2017-04-01'],
                  dtype='datetime64[ns]', freq='D')
    
    In [58]: pd.date_range('4/1/2017','6/1/2017',freq='BM')  # 指定频率,为月末工作日
    Out[58]: DatetimeIndex(['2017-04-28', '2017-05-31'], dtype='datetime64[ns]', freq='BM')
    
    In [59]: pd.date_range('5/3/2017 12:34:12',periods=5) # 默认时分秒 不变
    Out[59]:
    DatetimeIndex(['2017-05-03 12:34:12', '2017-05-04 12:34:12',
                   '2017-05-05 12:34:12', '2017-05-06 12:34:12',
                   '2017-05-07 12:34:12'],
                  dtype='datetime64[ns]', freq='D')
    
    In [60]: pd.date_range('5/3/2017 12:34:12',periods=5, normalize=True)  # 可以改到0时
    Out[60]:
    DatetimeIndex(['2017-05-03', '2017-05-04', '2017-05-05', '2017-05-06',
                   '2017-05-07'],
                  dtype='datetime64[ns]', freq='D')
    
    
    

    3.2 频率和日期偏移量

    In [61]: # 可以显式的创建频率使用的日期偏离
    
    In [62]: from pandas.tseries.offsets import Hour
    
    In [63]: four_hours = Hour(4)
    
    In [64]: four_hours
    Out[64]: <4 * Hours>
    
    In [65]: # 也可以直接使用4H之类的字符串直接指定
    
    In [66]: pd.date_range('1/1/2017', '1/3/2017 22:25',freq='4H')
    Out[66]:
    DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 04:00:00',
                   '2017-01-01 08:00:00', '2017-01-01 12:00:00',
                   '2017-01-01 16:00:00', '2017-01-01 20:00:00',
                   '2017-01-02 00:00:00', '2017-01-02 04:00:00',
                   '2017-01-02 08:00:00', '2017-01-02 12:00:00',
                   '2017-01-02 16:00:00', '2017-01-02 20:00:00',
                   '2017-01-03 00:00:00', '2017-01-03 04:00:00',
                   '2017-01-03 08:00:00', '2017-01-03 12:00:00',
                   '2017-01-03 16:00:00', '2017-01-03 20:00:00'],
                  dtype='datetime64[ns]', freq='4H')
    
    In [67]: pd.date_range('1/1/2017', '1/3/2017 22:25',freq=four_hours)
    Out[67]:
    DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 04:00:00',
                   '2017-01-01 08:00:00', '2017-01-01 12:00:00',
                   '2017-01-01 16:00:00', '2017-01-01 20:00:00',
                   '2017-01-02 00:00:00', '2017-01-02 04:00:00',
                   '2017-01-02 08:00:00', '2017-01-02 12:00:00',
                   '2017-01-02 16:00:00', '2017-01-02 20:00:00',
                   '2017-01-03 00:00:00', '2017-01-03 04:00:00',
                   '2017-01-03 08:00:00', '2017-01-03 12:00:00',
                   '2017-01-03 16:00:00', '2017-01-03 20:00:00'],
                  dtype='datetime64[ns]', freq='4H')
    
    
    In [68]: from pandas.tseries.offsets import Hour,Minute
    
    # 可以通过相加获得指定长度的时间偏移
    In [69]: Hour(1) + Minute(30)
    Out[69]: <90 * Minutes>
    
    # 也可以用更简单的字符串
    In [70]: pd.date_range('1/1/2017',periods=3, freq='1h30min')
    Out[70]:
    DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 01:30:00',
                   '2017-01-01 03:00:00'],
                  dtype='datetime64[ns]', freq='90T')
    
    

    有些偏移是不规律的,pandas自带了一些日期偏移量,供大家使用。如下表:

    3.3 移动(超前或滞后)数据

    shift沿着时间轴将数据进行前移或后移。

    In [71]: ts = Series(np.random.randn(4), index=pd.date_range('1/1/2017',periods=
        ...: 4, freq='M'))
    
    In [72]: ts
    Out[72]:
    2017-01-31   -0.080326
    2017-02-28    0.432715
    2017-03-31    1.094710
    2017-04-30   -1.024227
    Freq: M, dtype: float64
    
    
    In [73]: ts.shift(2)  # 将数据超前
    Out[73]:
    2017-01-31         NaN
    2017-02-28         NaN
    2017-03-31   -0.080326
    2017-04-30    0.432715
    Freq: M, dtype: float64
    
    In [74]: ts.shift(-2)  # 数据滞后
    Out[74]:
    2017-01-31    1.094710
    2017-02-28   -1.024227
    2017-03-31         NaN
    2017-04-30         NaN
    Freq: M, dtype: float64
    
    # 计算本月相对上月的增长率
    In [76]: ts/ts.shift(1) - 1
    Out[76]:
    2017-01-31         NaN
    2017-02-28   -6.386994
    2017-03-31    1.529866
    2017-04-30   -1.935615
    Freq: M, dtype: float64
    
    # 加上freq后,日期增长,数据位置行不变
    In [78]: ts.shift(2, freq='M')
    Out[78]:
    2017-03-31   -0.080326
    2017-04-30    0.432715
    2017-05-31    1.094710
    2017-06-30   -1.024227
    Freq: M, dtype: float64
    
    
    # 当然还能加上其他频率,会更加灵活
    
    In [79]: ts.shift(3, freq='D')
    Out[79]:
    2017-02-03   -0.080326
    2017-03-03    0.432715
    2017-04-03    1.094710
    2017-05-03   -1.024227
    dtype: float64
    
    In [80]: ts.shift(1, freq='3D')
    Out[80]:
    2017-02-03   -0.080326
    2017-03-03    0.432715
    2017-04-03    1.094710
    2017-05-03   -1.024227
    dtype: float64
    
    
    

    日期位移

    # day:偏移日期,可传入数量
    # MonthEnd:偏移到月末
    In [81]: from pandas.tseries.offsets import Day,MonthEnd
    
    In [82]: now = datetime(2017,2,18)
    
    In [83]: now + 3 * Day() # 通过+-直接计算日期
    Out[83]: Timestamp('2017-02-21 00:00:00')
    
    In [84]: now + MonthEnd() # 偏移到月末
    Out[84]: Timestamp('2017-02-28 00:00:00')
    
    In [85]: now + MonthEnd(1) # 下月末
    Out[85]: Timestamp('2017-02-28 00:00:00')
    
    In [86]: offset = MonthEnd()
    
    In [87]: offset.rollforward(now)  # 滚到本月末
    Out[87]: Timestamp('2017-02-28 00:00:00')
    
    In [88]: offset.rollback(now)  # 滚到上月末
    Out[88]: Timestamp('2017-01-31 00:00:00')
    
    In [90]: ts = Series(np.random.randn(20),index=pd.date_range('2/18/2017',periods
        ...: =20, freq='4d'))
    
    In [91]: ts.groupby(offset.rollforward).mean()  # 每个日期滚到月末后分组,并求平均值
    Out[91]:
    2017-02-28   -0.536243
    2017-03-31   -0.373386
    2017-04-30    0.131691
    2017-05-31    1.775742
    dtype: float64
    
    In [92]: ts.resample('M',how='mean')  # resample更易
    /Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: how in .resample() is deprecated
    the new syntax is .resample(...).mean()
      #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
    Out[92]:
    2017-02-28   -0.536243
    2017-03-31   -0.373386
    2017-04-30    0.131691
    2017-05-31    1.775742
    Freq: M, dtype: float64
    
    In [93]: ts.resample('M').mean()
    Out[93]:
    2017-02-28   -0.536243
    2017-03-31   -0.373386
    2017-04-30    0.131691
    2017-05-31    1.775742
    Freq: M, dtype: float64
    
    
    

    待续。。。

  • 相关阅读:
    (二十三)、int与integer的区别
    (二十一)、Java序列化与反序列化
    (二十)、MVC设计思想的优缺点
    (十九)、列出自己常用的JDK包
    (十八)、Session与Cookie区别
    win10 下安装laravel
    冒泡排序
    常用php正则表达式
    AE网站
    AE压缩网站
  • 原文地址:https://www.cnblogs.com/felo/p/6412910.html
Copyright © 2011-2022 走看看