zoukankan      html  css  js  c++  java
  • pandas时间序列频率处理

    生成日期范围
    pd.data_range()

    In [15]: rng = pd.date_range('2000-01-01', '2000-06-30', freq='BM')

    In [16]: rng
    Out[16]:
    DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-28',
    '2000-05-31', '2000-06-30'],
    dtype='datetime64[ns]', freq='BM')

    In [17]: Series(np.random.randn(6),index=rng)
    Out[17]:
    2000-01-31 0.586341
    2000-02-29 -0.439679
    2000-03-31 0.853946
    2000-04-28 -0.740858
    2000-05-31 -0.114699
    2000-06-30 -0.529631
    Freq: BM, dtype: float64
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    频率和日期偏移量
    from pandas.tseries.offsets import Hour, Minute

    移动(shifting)数据
    ts.shift()

    时期及其算术运算
    Period类 、 PeriodIndex类

    pd.period_range():创建规则的时期范围。

    In [20]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')
    ...: rng
    ...:
    Out[20]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')
    1
    2
    3
    4
    构造函数:
    pd.PeriodIndex()

    时期的频率转换
    ts.asfred()

    Timestamp(时间戳) 和 Period(时期) 的 转换
    In [21]: rng = pd.date_range('2000-01-01', '2000-06-30', freq='M')

    In [22]: rng
    Out[22]:
    DatetimeIndex(['2000-01-31', '2000-02-29', '2000-03-31', '2000-04-30',
    '2000-05-31', '2000-06-30'],
    dtype='datetime64[ns]', freq='M')

    In [23]: rng = pd.period_range('2000-01-01', '2000-06-30', freq='M')

    In [24]: rng
    Out[24]: PeriodIndex(['2000-01', '2000-02', '2000-03', '2000-04', '2000-05', '2000-06'], dtype='int64', freq='M')
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    to_period() to_timestamp()

    In [25]: rng = pd.date_range('2000-01-01', periods=3, freq='M')
    ...: ts = pd.Series(np.random.randn(3), index=rng)
    ...: ts
    ...:
    Out[25]:
    2000-01-31 0.455968
    2000-02-29 1.720553
    2000-03-31 1.695834
    Freq: M, dtype: float64

    In [26]: pts = ts.to_period()
    ...: pts
    ...:
    Out[26]:
    2000-01 0.455968
    2000-02 1.720553
    2000-03 1.695834
    Freq: M, dtype: float64
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    重采样及频率转换!!
    重采样(resampling)指的是将时间序列从一个频率转换到另一个频率的处理过程。

    高频率数据聚合到低频率称为降采样(downsamling),而将低频率数据转换到高频率数据则称为升采样(upsampling,通常伴随着插值)。

    resample() : 频率转换工作的主力函数

    参数 说明
    freq 表示重采样频率的字符串或DataOffset,例如‘M’、‘5min’、Second(15)
    how=’mean’ 用于产生聚合值的函数名或数组函数。默认为‘mean’ –> FutureWarning: how in .resample() is deprecated the new syntax is .resample(…).mean()
    axis=0 重采样的轴
    fill_method=None 升采样时如何插值,如‘ffill’或‘bfill’。默认不插值。
    closed=’right’ 降采样时哪一段是闭合的。
    label=’right’ 降采样时如何设置聚合值的标签
    loffset=None 面元标签的时间校正值,比如‘-1s’或者Second(-1)用于将聚合标签调早1秒
    limit = None 在前向或后向填充时,允许填充的最大时期数
    kind = None 聚合到时期(Period)或者时间戳(Timestamp),默认聚合到时间序列的索引类型
    convention=None 重采样时期时,低频转高频的约定,默认‘end’。
    降采样
    使用resample
    看下面的例子,使用resample对数据进行降采样时,需要考虑两样东西:

    各区间哪边是闭合的。
    如何标记各个聚合面元,用区间的开头还是末尾。
    In [27]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
    ...: ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ...: ts
    ...:
    Out[27]:
    2000-01-01 -0.189731
    ...
    2000-04-09 0.283110
    Freq: D, dtype: float64

    In [28]: ts.resample('M').mean()
    Out[28]:
    2000-01-31 -0.019276
    2000-02-29 -0.041192
    2000-03-31 -0.214551
    2000-04-30 0.411190
    Freq: M, dtype: float64

    In [29]: ts.resample('M', kind='period').mean()
    Out[29]:
    2000-01 -0.019276
    2000-02 -0.041192
    2000-03 -0.214551
    2000-04 0.411190
    Freq: M, dtype: float64
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    In [31]: rng = pd.date_range('2000-01-01', periods=12, freq='T')
    ...: ts = pd.Series(np.arange(12), index=rng)
    ...: ts
    ...:
    Out[31]:
    2000-01-01 00:00:00 0
    2000-01-01 00:01:00 1
    2000-01-01 00:02:00 2
    2000-01-01 00:03:00 3
    2000-01-01 00:04:00 4
    2000-01-01 00:05:00 5
    2000-01-01 00:06:00 6
    2000-01-01 00:07:00 7
    2000-01-01 00:08:00 8
    2000-01-01 00:09:00 9
    2000-01-01 00:10:00 10
    2000-01-01 00:11:00 11
    Freq: T, dtype: int32

    In [32]: ts.resample('5min', closed='right', label='right').sum()
    Out[32]:
    2000-01-01 00:00:00 0
    2000-01-01 00:05:00 15
    2000-01-01 00:10:00 40
    2000-01-01 00:15:00 11
    Freq: 5T, dtype: int32

    In [33]: ts.resample('5min', closed='right',
    ...: label='right', loffset='-1s').sum()
    Out[33]:
    1999-12-31 23:59:59 0
    2000-01-01 00:04:59 15
    2000-01-01 00:09:59 40
    2000-01-01 00:14:59 11
    Freq: 5T, dtype: int32
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    通过groupby进行降采样
    打算根据月份或者星期进行分组,传入能够访问时间序列的索引上的这些字段的函数。

    In [35]: rng = pd.date_range('2000-01-01', periods=100, freq='D')
    ...: ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ...: ts

    In [36]: ts.groupby(lambda x : x.month).mean()
    Out[36]:
    1 -0.126008
    2 0.079132
    3 0.026093
    4 0.321457
    dtype: float64

    In [37]: ts.groupby(lambda x : x.weekday).mean()
    Out[37]:
    0 0.280289
    1 0.174452
    2 0.166102
    3 -0.779489
    4 -0.036195
    5 0.086394
    6 0.234831
    dtype: float64
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    升采样
    In [38]: import pandas as pd
    ...: import numpy as np
    ...: frame = pd.DataFrame(np.random.randn(2, 4),
    ...: index=pd.date_range('1/1/2000', periods=2,
    ...: freq='W-WED'),
    ...: columns=['Colorado', 'Texas', 'New York', 'Ohio'])
    ...: frame
    ...:
    Out[38]:
    Colorado Texas New York Ohio
    2000-01-05 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-12 1.075744 0.237922 -0.907699 0.592211

    In [39]: df_daily = frame.resample('D').asfreq()
    ...: df_daily
    ...:
    Out[39]:
    Colorado Texas New York Ohio
    2000-01-05 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-06 NaN NaN NaN NaN
    2000-01-07 NaN NaN NaN NaN
    2000-01-08 NaN NaN NaN NaN
    2000-01-09 NaN NaN NaN NaN
    2000-01-10 NaN NaN NaN NaN
    2000-01-11 NaN NaN NaN NaN
    2000-01-12 1.075744 0.237922 -0.907699 0.592211

    In [40]: frame.resample('D').ffill()
    Out[40]:
    Colorado Texas New York Ohio
    2000-01-05 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-06 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-07 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-08 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-09 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-10 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-11 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-12 1.075744 0.237922 -0.907699 0.592211

    # 之前的frame.resample('D', how='mean')

    In [41]: df_daily = frame.resample('D').mean()
    ...: df_daily
    ...:
    Out[41]:
    Colorado Texas New York Ohio
    2000-01-05 -0.925525 -0.434350 1.037349 -1.532790
    2000-01-06 NaN NaN NaN NaN
    2000-01-07 NaN NaN NaN NaN
    2000-01-08 NaN NaN NaN NaN
    2000-01-09 NaN NaN NaN NaN
    2000-01-10 NaN NaN NaN NaN
    2000-01-11 NaN NaN NaN NaN
    2000-01-12 1.075744 0.237922 -0.907699 0.592211
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    对于时期进行重采样。
    In [42]: frame = pd.DataFrame(np.random.randn(24, 4),
    ...: index=pd.period_range('1-2000', '12-2001',
    ...: freq='M'),
    ...: columns=['Colorado', 'Texas', 'New York', 'Ohio'])
    ...: frame[:5]
    ...: annual_frame = frame.resample('A-DEC').mean()
    ...: annual_frame
    ...:
    Out[42]:
    Colorado Texas New York Ohio
    2000 0.442672 0.104870 -0.067043 -0.128942
    2001 -0.263757 -0.399865 -0.423485 0.026256

    In [43]: annual_frame.resample('Q-DEC', convention='end').ffill()
    Out[43]:
    Colorado Texas New York Ohio
    2000Q4 0.442672 0.104870 -0.067043 -0.128942
    2001Q1 0.442672 0.104870 -0.067043 -0.128942
    2001Q2 0.442672 0.104870 -0.067043 -0.128942
    2001Q3 0.442672 0.104870 -0.067043 -0.128942
    2001Q4 -0.263757 -0.399865 -0.423485 0.026256

  • 相关阅读:
    MySQL 5.6.9 RC 发布
    红薯 Java 8 的日期时间新用法
    Couchbase Server 2.0 发布,NoSQL 数据库
    Firefox OS 模拟器 1.0 发布
    Calculate Linux 13 Beta 1 发布
    敏捷测试的团队构成
    Node.js 0.8.16 发布(稳定版)
    JASocket 1.1.0 发布
    Samba 4.0 正式版发布,支持活动目录
    Seafile 1.3 发布,文件同步和协作平台
  • 原文地址:https://www.cnblogs.com/zuichuyouren/p/11277411.html
Copyright © 2011-2022 走看看