zoukankan      html  css  js  c++  java
  • pandas 之 时间序列索引

    import numpy as np 
    import pandas as pd 
    

    引入

    A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented external to pandas as Python string or datetime objects:

    from datetime import datetime
    
    dates = [
        datetime(2011, 1, 2),
        datetime(2011, 1, 5),
        datetime(2011, 1, 7),
        datetime(2011, 1, 8),
        datetime(2011, 1, 10),
        datetime(2011, 1, 12)
    ]
    
    ts = pd.Series(np.random.randn(6), index=dates)
    
    ts
    
    2011-01-02    0.825502
    2011-01-05    0.453766
    2011-01-07    0.077024
    2011-01-08   -1.320742
    2011-01-10   -1.109912
    2011-01-12   -0.469907
    dtype: float64
    

    Under the hood, these datetime objects have been put in a DatetimeIndex:

    ts.index
    
    DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
                   '2011-01-10', '2011-01-12'],
                  dtype='datetime64[ns]', freq=None)
    

    Like other Series, arithmetic operations between differently indexed time series auto-matically align(自动对齐) on the dates:

    ts + ts[::2]
    
    2011-01-02    1.651004
    2011-01-05         NaN
    2011-01-07    0.154049
    2011-01-08         NaN
    2011-01-10   -2.219823
    2011-01-12         NaN
    dtype: float64
    

    Recall that ts[::2] selects every second element in ts:

    pandas stores timestamp using NumPy's datetime64 data type the nanosecond resolution:

    ts.index.dtype
    
    dtype('<M8[ns]')
    

    Scalar values from a DatetimeIndex are Timestamp object:

    stamp = ts.index[0]
    
    stamp
    
    Timestamp('2011-01-02 00:00:00')
    

    A Timestamp can be substituted(被替代) anywhere you would use a datetime object. Additionally, it can store frequency information(if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.
    (各种转换操作, 对于时间序列)

    索引-切片

    Time series behaves like any other pandas.Series when you are indexing and selecting data based on label:

    stamp = ts.index[2]
    
    ts[stamp]
    
    0.0770243257021936
    

    As a convenience, you can also pass a string that is interpretable as a date:

    ts['1/10/2011']
    
    -1.109911691867437
    
    ts['20110110']
    
    -1.109911691867437
    

    For longer time series, a year or only a year and month can be passed to easly select slices of data:

    longer_ts = pd.Series(np.random.randn(1000),
                         index=pd.date_range('1/1/2000', periods=1000))
    
    longer_ts[:5]
    
    2000-01-01    0.401394
    2000-01-02    0.720214
    2000-01-03    0.488505
    2000-01-04    0.446179
    2000-01-05   -2.129299
    Freq: D, dtype: float64
    
    longer_ts['2001'][:5]
    
    2001-01-01    0.315472
    2001-01-02    0.796386
    2001-01-03    0.611503
    2001-01-04    0.980799
    2001-01-05    0.184401
    Freq: D, dtype: float64
    

    Here, the string '2001' is interpreted as a year and selects that time period. This also works if you speicify the month:

    longer_ts['2001-05'][:5]
    
    2001-05-01    0.439009
    2001-05-02   -0.304236
    2001-05-03    0.603268
    2001-05-04   -0.726460
    2001-05-05   -0.521669
    Freq: D, dtype: float64
    
    "Slicing with detetime objects works as well"
    
    ts[datetime(2011, 1, 7):]
    
    'Slicing with detetime objects works as well'
    
    
    
    
    
    
    2011-01-07    0.077024
    2011-01-08   -1.320742
    2011-01-10   -1.109912
    2011-01-12   -0.469907
    dtype: float64
    

    Because most time series data is ordered chrnologically(按年代顺序的), you can slice with time-stamps not contained in a time series to perform a range query:

    ts
    
    2011-01-02    0.825502
    2011-01-05    0.453766
    2011-01-07    0.077024
    2011-01-08   -1.320742
    2011-01-10   -1.109912
    2011-01-12   -0.469907
    dtype: float64
    
    ts['1/6/2011': '1/11/2011']
    
    2011-01-07    0.077024
    2011-01-08   -1.320742
    2011-01-10   -1.109912
    dtype: float64
    

    As before, you can pass either a string date, datetime or timestamp. Remember that slicing in this manner produces views on the source time series like slicing NumPy arrays. This means that no data is copied and modifications on the slice will be reflected in the orginal data.

    There is an equivalent instance method,truncate that slices a Series between two dates:

    ts.truncate(after='1/9/2011')
    
    2011-01-02    0.825502
    2011-01-05    0.453766
    2011-01-07    0.077024
    2011-01-08   -1.320742
    dtype: float64
    

    All of this holds true for DataFrame as well, indexing on its rows:

    # periods: 多少个, freq: 间隔
    dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')
    
    long_df = pd.DataFrame(np.random.randn(100, 4), 
                          index=dates, 
                          columns=['Colorado', 'Texas', 'New York', 'Ohio'])
    
    long_df.loc['5-2001']
    
    
    Colorado Texas New York Ohio
    2001-05-02 0.972317 0.407519 0.628906 1.995901
    2001-05-09 0.299961 -1.208505 1.019247 2.244728
    2001-05-16 0.628163 -0.716498 0.621912 1.257635
    2001-05-23 0.508852 0.753517 -0.793127 0.273496
    2001-05-30 -1.443141 -0.878143 -0.680227 0.455401

    重复索引

    • ts.is_unique
    • ts.groupby(level=0)

    In some applications, there may be multiple data observations falling on a particular timestamp.Here is an example:

    dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', 
                             '1/2/2000', '1/2/2000', '1/3/2000'
                             ])
    
    dup_ts = pd.Series(np.arange(5), index=dates)
    
    dup_ts
    
    2000-01-01    0
    2000-01-02    1
    2000-01-02    2
    2000-01-02    3
    2000-01-03    4
    dtype: int32
    

    We can tell that the index is not unique by checking its is_unique property:

    dup_ts.index.is_unique
    
    False
    

    Indexing into this time series will now either produce scalar values or slice depending on whether a timestamp is duplicated:

    dup_ts['1/3/2000']  # not duplicated
    
    4
    
    dup_ts['1/2/2000']  # duplicated
    
    2000-01-02    1
    2000-01-02    2
    2000-01-02    3
    dtype: int32
    

    Suppose you wanted to aggregate the data having non-unique timestamps. One way to do this is use groupby and pass level=0

    grouped = dup_ts.groupby(level=0)  # 没有level 会报错, 默认是None
    
    grouped.mean()  
    
    2000-01-01    0
    2000-01-02    2
    2000-01-03    4
    dtype: int32
    
    grouped.count()
    
    2000-01-01    1
    2000-01-02    3
    2000-01-03    1
    dtype: int64
  • 相关阅读:
    正则表达式
    抽象
    面向对象
    this关键字
    http协议
    URL
    Ajax
    PHP命名空间
    PDO
    异常
  • 原文地址:https://www.cnblogs.com/chenjieyouge/p/12046377.html
Copyright © 2011-2022 走看看