zoukankan      html  css  js  c++  java
  • Python时间序列分析

    时间序列与时间序列分析

      在生产和科学研究中,对某一个或者一组变量  进行观察测量,将在一系列时刻所得到的离散数字组成的序列集合,称之为时间序列。 
      时间序列分析是根据系统观察得到的时间序列数据,通过曲线拟合和参数估计来建立数学模型的理论和方法。时间序列分析常用于国民宏观经济控制、市场潜力预测、气象预测、农作物害虫灾害预报等各个方面。

    Pandas生成时间序列:

    import pandas as pd
    import numpy as np  

    时间序列

    • 时间戳(timestamp)
    • 固定周期(period)
    • 时间间隔(interval)

    date_range

    • 可以指定开始时间与周期
    • H:小时
    • D:天
    • M:月
    # TIMES的几种书写方式 #2016 Jul 1; 7/1/2016; 1/7/2016 ;2016-07-01; 2016/07/01
    rng = pd.date_range('2016-07-01', periods = 10, freq = '3D')#不传freq则默认是D
    rng
    

      结果:

    DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-07', '2016-07-10',
                   '2016-07-13', '2016-07-16', '2016-07-19', '2016-07-22',
                   '2016-07-25', '2016-07-28'],
                  dtype='datetime64[ns]', freq='3D')
    View Code
    time=pd.Series(np.random.randn(20),
               index=pd.date_range(dt.datetime(2016,1,1),periods=20))
    print(time)
    #结果:
    2016-01-01   -0.129379
    2016-01-02    0.164480
    2016-01-03   -0.639117
    2016-01-04   -0.427224
    2016-01-05    2.055133
    2016-01-06    1.116075
    2016-01-07    0.357426
    2016-01-08    0.274249
    2016-01-09    0.834405
    2016-01-10   -0.005444
    2016-01-11   -0.134409
    2016-01-12    0.249318
    2016-01-13   -0.297842
    2016-01-14   -0.128514
    2016-01-15    0.063690
    2016-01-16   -2.246031
    2016-01-17    0.359552
    2016-01-18    0.383030
    2016-01-19    0.402717
    2016-01-20   -0.694068
    Freq: D, dtype: float64
    

    truncate过滤

    time.truncate(before='2016-1-10')#1月10之前的都被过滤掉了
    

      结果:

    2016-01-10   -0.005444
    2016-01-11   -0.134409
    2016-01-12    0.249318
    2016-01-13   -0.297842
    2016-01-14   -0.128514
    2016-01-15    0.063690
    2016-01-16   -2.246031
    2016-01-17    0.359552
    2016-01-18    0.383030
    2016-01-19    0.402717
    2016-01-20   -0.694068
    Freq: D, dtype: float64
    View Code
    time.truncate(after='2016-1-10')#1月10之后的都被过滤掉了
    #结果:
    2016-01-01   -0.129379
    2016-01-02    0.164480
    2016-01-03   -0.639117
    2016-01-04   -0.427224
    2016-01-05    2.055133
    2016-01-06    1.116075
    2016-01-07    0.357426
    2016-01-08    0.274249
    2016-01-09    0.834405
    2016-01-10   -0.005444
    Freq: D, dtype: float64
    

      

    print(time['2016-01-15'])#0.063690487247
    print(time['2016-01-15':'2016-01-20'])
    结果:
    2016-01-15    0.063690
    2016-01-16   -2.246031
    2016-01-17    0.359552
    2016-01-18    0.383030
    2016-01-19    0.402717
    2016-01-20   -0.694068
    Freq: D, dtype: float64
    
    
    data=pd.date_range('2010-01-01','2011-01-01',freq='M')
    print(data)
    #结果:
    DatetimeIndex(['2010-01-31', '2010-02-28', '2010-03-31', '2010-04-30',
                   '2010-05-31', '2010-06-30', '2010-07-31', '2010-08-31',
                   '2010-09-30', '2010-10-31', '2010-11-30', '2010-12-31'],
                  dtype='datetime64[ns]', freq='M')
    

      

    #时间戳
    pd.Timestamp('2016-07-10')#Timestamp('2016-07-10 00:00:00')
    # 可以指定更多细节
    pd.Timestamp('2016-07-10 10')#Timestamp('2016-07-10 10:00:00')
    pd.Timestamp('2016-07-10 10:15')#Timestamp('2016-07-10 10:15:00')
    
    # How much detail can you add?
    t = pd.Timestamp('2016-07-10 10:15')
    
    # 时间区间
    pd.Period('2016-01')#Period('2016-01', 'M')
    pd.Period('2016-01-01')#Period('2016-01-01', 'D')
    
    # TIME OFFSETS
    pd.Timedelta('1 day')#Timedelta('1 days 00:00:00')
    pd.Period('2016-01-01 10:10') + pd.Timedelta('1 day')#Period('2016-01-02 10:10', 'T')
    pd.Timestamp('2016-01-01 10:10') + pd.Timedelta('1 day')#Timestamp('2016-01-02 10:10:00')
    pd.Timestamp('2016-01-01 10:10') + pd.Timedelta('15 ns')#Timestamp('2016-01-01 10:10:00.000000015')
    
    p1 = pd.period_range('2016-01-01 10:10', freq = '25H', periods = 10)
    p2 = pd.period_range('2016-01-01 10:10', freq = '1D1H', periods = 10)
    p1
    p2
    结果:
    PeriodIndex(['2016-01-01 10:00', '2016-01-02 11:00', '2016-01-03 12:00',
                 '2016-01-04 13:00', '2016-01-05 14:00', '2016-01-06 15:00',
                 '2016-01-07 16:00', '2016-01-08 17:00', '2016-01-09 18:00',
                 '2016-01-10 19:00'],
                dtype='period[25H]', freq='25H')
    PeriodIndex(['2016-01-01 10:00', '2016-01-02 11:00', '2016-01-03 12:00',
                 '2016-01-04 13:00', '2016-01-05 14:00', '2016-01-06 15:00',
                 '2016-01-07 16:00', '2016-01-08 17:00', '2016-01-09 18:00',
                 '2016-01-10 19:00'],
                dtype='period[25H]', freq='25H')
    
    # 指定索引
    rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')
    rng
    pd.Series(range(len(rng)), index = rng)
    结果:
    2016-07-01    0
    2016-07-02    1
    2016-07-03    2
    2016-07-04    3
    2016-07-05    4
    2016-07-06    5
    2016-07-07    6
    2016-07-08    7
    2016-07-09    8
    2016-07-10    9
    Freq: D, dtype: int32
    
    periods = [pd.Period('2016-01'), pd.Period('2016-02'), pd.Period('2016-03')]
    ts = pd.Series(np.random.randn(len(periods)), index = periods)
    ts
    结果:
    2016-01   -0.015837
    2016-02   -0.923463
    2016-03   -0.485212
    Freq: M, dtype: float64
    
    type(ts.index)#pandas.core.indexes.period.PeriodIndex
    
    # 时间戳和时间周期可以转换
    ts = pd.Series(range(10), pd.date_range('07-10-16 8:00', periods = 10, freq = 'H'))
    ts
    结果:
    2016-07-10 08:00:00    0
    2016-07-10 09:00:00    1
    2016-07-10 10:00:00    2
    2016-07-10 11:00:00    3
    2016-07-10 12:00:00    4
    2016-07-10 13:00:00    5
    2016-07-10 14:00:00    6
    2016-07-10 15:00:00    7
    2016-07-10 16:00:00    8
    2016-07-10 17:00:00    9
    Freq: H, dtype: int32
    
    ts_period = ts.to_period()
    ts_period
    结果:
    2016-07-10 08:00    0
    2016-07-10 09:00    1
    2016-07-10 10:00    2
    2016-07-10 11:00    3
    2016-07-10 12:00    4
    2016-07-10 13:00    5
    2016-07-10 14:00    6
    2016-07-10 15:00    7
    2016-07-10 16:00    8
    2016-07-10 17:00    9
    Freq: H, dtype: int32
    
    时间周期与时间戳的区别
    
    ts_period['2016-07-10 08:30':'2016-07-10 11:45'] #时间周期包含08:00
    结果:
    2016-07-10 08:00    0
    2016-07-10 09:00    1
    2016-07-10 10:00    2
    2016-07-10 11:00    3
    Freq: H, dtype: int32
    
    ts['2016-07-10 08:30':'2016-07-10 11:45'] #时间戳不包含08:30
    #结果:
    2016-07-10 09:00:00    1
    2016-07-10 10:00:00    2
    2016-07-10 11:00:00    3
    Freq: H, dtype: int32
    

    数据重采样:

    • 时间数据由一个频率转换到另一个频率
    • 降采样
    • 升采样
    import pandas as pd
    import numpy as np
    rng = pd.date_range('1/1/2011', periods=90, freq='D')#数据按天
    ts = pd.Series(np.random.randn(len(rng)), index=rng)
    ts.head()
    结果:
    2011-01-01   -1.025562
    2011-01-02    0.410895
    2011-01-03    0.660311
    2011-01-04    0.710293
    2011-01-05    0.444985
    Freq: D, dtype: float64
    
    ts.resample('M').sum()#数据降采样,降为月,指标是求和,也可以平均,自己指定
    结果:
    2011-01-31    2.510102
    2011-02-28    0.583209
    2011-03-31    2.749411
    Freq: M, dtype: float64
    
    ts.resample('3D').sum()#数据降采样,降为3天
    结果:
    2011-01-01    0.045643
    2011-01-04   -2.255206
    2011-01-07    0.571142
    2011-01-10    0.835032
    2011-01-13   -0.396766
    2011-01-16   -1.156253
    2011-01-19   -1.286884
    2011-01-22    2.883952
    2011-01-25    1.566908
    2011-01-28    1.435563
    2011-01-31    0.311565
    2011-02-03   -2.541235
    2011-02-06    0.317075
    2011-02-09    1.598877
    2011-02-12   -1.950509
    2011-02-15    2.928312
    2011-02-18   -0.733715
    2011-02-21    1.674817
    2011-02-24   -2.078872
    2011-02-27    2.172320
    2011-03-02   -2.022104
    2011-03-05   -0.070356
    2011-03-08    1.276671
    2011-03-11   -2.835132
    2011-03-14   -1.384113
    2011-03-17    1.517565
    2011-03-20   -0.550406
    2011-03-23    0.773430
    2011-03-26    2.244319
    2011-03-29    2.951082
    Freq: 3D, dtype: float64
    
    day3Ts = ts.resample('3D').mean()
    day3Ts
    结果:
    2011-01-01    0.015214
    2011-01-04   -0.751735
    2011-01-07    0.190381
    2011-01-10    0.278344
    2011-01-13   -0.132255
    2011-01-16   -0.385418
    2011-01-19   -0.428961
    2011-01-22    0.961317
    2011-01-25    0.522303
    2011-01-28    0.478521
    2011-01-31    0.103855
    2011-02-03   -0.847078
    2011-02-06    0.105692
    2011-02-09    0.532959
    2011-02-12   -0.650170
    2011-02-15    0.976104
    2011-02-18   -0.244572
    2011-02-21    0.558272
    2011-02-24   -0.692957
    2011-02-27    0.724107
    2011-03-02   -0.674035
    2011-03-05   -0.023452
    2011-03-08    0.425557
    2011-03-11   -0.945044
    2011-03-14   -0.461371
    2011-03-17    0.505855
    2011-03-20   -0.183469
    2011-03-23    0.257810
    2011-03-26    0.748106
    2011-03-29    0.983694
    Freq: 3D, dtype: float64
    
    print(day3Ts.resample('D').asfreq())#升采样,要进行插值
    结果:
    2011-01-01    0.015214
    2011-01-02         NaN
    2011-01-03         NaN
    2011-01-04   -0.751735
    2011-01-05         NaN
    2011-01-06         NaN
    2011-01-07    0.190381
    2011-01-08         NaN
    2011-01-09         NaN
    2011-01-10    0.278344
    2011-01-11         NaN
    2011-01-12         NaN
    2011-01-13   -0.132255
    2011-01-14         NaN
    2011-01-15         NaN
    2011-01-16   -0.385418
    2011-01-17         NaN
    2011-01-18         NaN
    2011-01-19   -0.428961
    2011-01-20         NaN
    2011-01-21         NaN
    2011-01-22    0.961317
    2011-01-23         NaN
    2011-01-24         NaN
    2011-01-25    0.522303
    2011-01-26         NaN
    2011-01-27         NaN
    2011-01-28    0.478521
    2011-01-29         NaN
    2011-01-30         NaN
                    ...   
    2011-02-28         NaN
    2011-03-01         NaN
    2011-03-02   -0.674035
    2011-03-03         NaN
    2011-03-04         NaN
    2011-03-05   -0.023452
    2011-03-06         NaN
    2011-03-07         NaN
    2011-03-08    0.425557
    2011-03-09         NaN
    2011-03-10         NaN
    2011-03-11   -0.945044
    2011-03-12         NaN
    2011-03-13         NaN
    2011-03-14   -0.461371
    2011-03-15         NaN
    2011-03-16         NaN
    2011-03-17    0.505855
    2011-03-18         NaN
    2011-03-19         NaN
    2011-03-20   -0.183469
    2011-03-21         NaN
    2011-03-22         NaN
    2011-03-23    0.257810
    2011-03-24         NaN
    2011-03-25         NaN
    2011-03-26    0.748106
    2011-03-27         NaN
    2011-03-28         NaN
    2011-03-29    0.983694
    Freq: D, Length: 88, dtype: float64

    插值方法:

    • ffill 空值取前面的值
    • bfill 空值取后面的值
    • interpolate 线性取值
    day3Ts.resample('D').ffill(1)
    结果:
    2011-01-01    0.015214
    2011-01-02    0.015214
    2011-01-03         NaN
    2011-01-04   -0.751735
    2011-01-05   -0.751735
    2011-01-06         NaN
    2011-01-07    0.190381
    2011-01-08    0.190381
    2011-01-09         NaN
    2011-01-10    0.278344
    2011-01-11    0.278344
    2011-01-12         NaN
    2011-01-13   -0.132255
    2011-01-14   -0.132255
    2011-01-15         NaN
    2011-01-16   -0.385418
    2011-01-17   -0.385418
    2011-01-18         NaN
    2011-01-19   -0.428961
    2011-01-20   -0.428961
    2011-01-21         NaN
    2011-01-22    0.961317
    2011-01-23    0.961317
    2011-01-24         NaN
    2011-01-25    0.522303
    2011-01-26    0.522303
    2011-01-27         NaN
    2011-01-28    0.478521
    2011-01-29    0.478521
    2011-01-30         NaN
                    ...   
    2011-02-28    0.724107
    2011-03-01         NaN
    2011-03-02   -0.674035
    2011-03-03   -0.674035
    2011-03-04         NaN
    2011-03-05   -0.023452
    2011-03-06   -0.023452
    2011-03-07         NaN
    2011-03-08    0.425557
    2011-03-09    0.425557
    2011-03-10         NaN
    2011-03-11   -0.945044
    2011-03-12   -0.945044
    2011-03-13         NaN
    2011-03-14   -0.461371
    2011-03-15   -0.461371
    2011-03-16         NaN
    2011-03-17    0.505855
    2011-03-18    0.505855
    2011-03-19         NaN
    2011-03-20   -0.183469
    2011-03-21   -0.183469
    2011-03-22         NaN
    2011-03-23    0.257810
    2011-03-24    0.257810
    2011-03-25         NaN
    2011-03-26    0.748106
    2011-03-27    0.748106
    2011-03-28         NaN
    2011-03-29    0.983694
    Freq: D, Length: 88, dtype: float64
    
    day3Ts.resample('D').bfill(1)
    结果:
    2011-01-01    0.015214
    2011-01-02         NaN
    2011-01-03   -0.751735
    2011-01-04   -0.751735
    2011-01-05         NaN
    2011-01-06    0.190381
    2011-01-07    0.190381
    2011-01-08         NaN
    2011-01-09    0.278344
    2011-01-10    0.278344
    2011-01-11         NaN
    2011-01-12   -0.132255
    2011-01-13   -0.132255
    2011-01-14         NaN
    2011-01-15   -0.385418
    2011-01-16   -0.385418
    2011-01-17         NaN
    2011-01-18   -0.428961
    2011-01-19   -0.428961
    2011-01-20         NaN
    2011-01-21    0.961317
    2011-01-22    0.961317
    2011-01-23         NaN
    2011-01-24    0.522303
    2011-01-25    0.522303
    2011-01-26         NaN
    2011-01-27    0.478521
    2011-01-28    0.478521
    2011-01-29         NaN
    2011-01-30    0.103855
                    ...   
    2011-02-28         NaN
    2011-03-01   -0.674035
    2011-03-02   -0.674035
    2011-03-03         NaN
    2011-03-04   -0.023452
    2011-03-05   -0.023452
    2011-03-06         NaN
    2011-03-07    0.425557
    2011-03-08    0.425557
    2011-03-09         NaN
    2011-03-10   -0.945044
    2011-03-11   -0.945044
    2011-03-12         NaN
    2011-03-13   -0.461371
    2011-03-14   -0.461371
    2011-03-15         NaN
    2011-03-16    0.505855
    2011-03-17    0.505855
    2011-03-18         NaN
    2011-03-19   -0.183469
    2011-03-20   -0.183469
    2011-03-21         NaN
    2011-03-22    0.257810
    2011-03-23    0.257810
    2011-03-24         NaN
    2011-03-25    0.748106
    2011-03-26    0.748106
    2011-03-27         NaN
    2011-03-28    0.983694
    2011-03-29    0.983694
    Freq: D, Length: 88, dtype: float64
    
    day3Ts.resample('D').interpolate('linear')#线性拟合填充
    结果:
    2011-01-01    0.015214
    2011-01-02   -0.240435
    2011-01-03   -0.496085
    2011-01-04   -0.751735
    2011-01-05   -0.437697
    2011-01-06   -0.123658
    2011-01-07    0.190381
    2011-01-08    0.219702
    2011-01-09    0.249023
    2011-01-10    0.278344
    2011-01-11    0.141478
    2011-01-12    0.004611
    2011-01-13   -0.132255
    2011-01-14   -0.216643
    2011-01-15   -0.301030
    2011-01-16   -0.385418
    2011-01-17   -0.399932
    2011-01-18   -0.414447
    2011-01-19   -0.428961
    2011-01-20    0.034465
    2011-01-21    0.497891
    2011-01-22    0.961317
    2011-01-23    0.814979
    2011-01-24    0.668641
    2011-01-25    0.522303
    2011-01-26    0.507709
    2011-01-27    0.493115
    2011-01-28    0.478521
    2011-01-29    0.353632
    2011-01-30    0.228744
                    ...   
    2011-02-28    0.258060
    2011-03-01   -0.207988
    2011-03-02   -0.674035
    2011-03-03   -0.457174
    2011-03-04   -0.240313
    2011-03-05   -0.023452
    2011-03-06    0.126218
    2011-03-07    0.275887
    2011-03-08    0.425557
    2011-03-09   -0.031310
    2011-03-10   -0.488177
    2011-03-11   -0.945044
    2011-03-12   -0.783820
    2011-03-13   -0.622595
    2011-03-14   -0.461371
    2011-03-15   -0.138962
    2011-03-16    0.183446
    2011-03-17    0.505855
    2011-03-18    0.276080
    2011-03-19    0.046306
    2011-03-20   -0.183469
    2011-03-21   -0.036376
    2011-03-22    0.110717
    2011-03-23    0.257810
    2011-03-24    0.421242
    2011-03-25    0.584674
    2011-03-26    0.748106
    2011-03-27    0.826636
    2011-03-28    0.905165
    2011-03-29    0.983694
    Freq: D, Length: 88, dtype: float64
    

    Pandas滑动窗口:

      滑动窗口就是能够根据指定的单位长度来框住时间序列,从而计算框内的统计指标。相当于一个长度指定的滑块在刻度尺上面滑动,每滑动一个单位即可反馈滑块内的数据。

      滑动窗口可以使数据更加平稳,浮动范围会比较小,具有代表性,单独拿出一个数据可能或多或少会离群,有差异或者错误,使用滑动窗口会更规范一些。

    %matplotlib inline 
    import matplotlib.pylab
    import numpy as np
    import pandas as pd
    df = pd.Series(np.random.randn(600), index = pd.date_range('7/1/2016', freq = 'D', periods = 600))
    df.head()
    结果:
    2016-07-01   -0.192140
    2016-07-02    0.357953
    2016-07-03   -0.201847
    2016-07-04   -0.372230
    2016-07-05    1.414753
    Freq: D, dtype: float64
    
    r = df.rolling(window = 10)
    r#Rolling [window=10,center=False,axis=0]
    
    #r.max, r.median, r.std, r.skew倾斜度, r.sum, r.var
    print(r.mean())
    结果:
    2016-07-01         NaN
    2016-07-02         NaN
    2016-07-03         NaN
    2016-07-04         NaN
    2016-07-05         NaN
    2016-07-06         NaN
    2016-07-07         NaN
    2016-07-08         NaN
    2016-07-09         NaN
    2016-07-10    0.300133
    2016-07-11    0.284780
    2016-07-12    0.252831
    2016-07-13    0.220699
    2016-07-14    0.167137
    2016-07-15    0.018593
    2016-07-16   -0.061414
    2016-07-17   -0.134593
    2016-07-18   -0.153333
    2016-07-19   -0.218928
    2016-07-20   -0.169426
    2016-07-21   -0.219747
    2016-07-22   -0.181266
    2016-07-23   -0.173674
    2016-07-24   -0.130629
    2016-07-25   -0.166730
    2016-07-26   -0.233044
    2016-07-27   -0.256642
    2016-07-28   -0.280738
    2016-07-29   -0.289893
    2016-07-30   -0.379625
                    ...   
    2018-01-22   -0.211467
    2018-01-23    0.034996
    2018-01-24   -0.105910
    2018-01-25   -0.145774
    2018-01-26   -0.089320
    2018-01-27   -0.164370
    2018-01-28   -0.110892
    2018-01-29   -0.205786
    2018-01-30   -0.101162
    2018-01-31   -0.034760
    2018-02-01    0.229333
    2018-02-02    0.043741
    2018-02-03    0.052837
    2018-02-04    0.057746
    2018-02-05   -0.071401
    2018-02-06   -0.011153
    2018-02-07   -0.045737
    2018-02-08   -0.021983
    2018-02-09   -0.196715
    2018-02-10   -0.063721
    2018-02-11   -0.289452
    2018-02-12   -0.050946
    2018-02-13   -0.047014
    2018-02-14    0.048754
    2018-02-15    0.143949
    2018-02-16    0.424823
    2018-02-17    0.361878
    2018-02-18    0.363235
    2018-02-19    0.517436
    2018-02-20    0.368020
    Freq: D, Length: 600, dtype: float64
    
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    plt.figure(figsize=(15, 5))
    
    df.plot(style='r--')
    df.rolling(window=10).mean().plot(style='b')#<matplotlib.axes._subplots.AxesSubplot at 0x249627fb6d8>
    

      结果:

    数据平稳性与差分法:

      基本模型:自回归移动平均模型(ARMA(p,q))是时间序列中最为重要的模型之一。它主要由两部分组成: AR代表p阶自回归过程,MA代表q阶移动平均过程。

    平稳性检验

      我们知道序列平稳性是进行时间序列分析的前提条件,很多人都会有疑问,为什么要满足平稳性的要求呢?在大数定理和中心定理中要求样本同分布(这里同分布等价于时间序列中的平稳性),而我们的建模过程中有很多都是建立在大数定理和中心极限定理的前提条件下的,如果它不满足,得到的许多结论都是不可靠的。以虚假回归为例,当响应变量和输入变量都平稳时,我们用t统计量检验标准化系数的显著性。而当响应变量和输入变量不平稳时,其标准化系数不在满足t分布,这时再用t检验来进行显著性分析,导致拒绝原假设的概率增加,即容易犯第一类错误,从而得出错误的结论。

      平稳时间序列有两种定义:严平稳和宽平稳

      严平稳顾名思义,是一种条件非常苛刻的平稳性,它要求序列随着时间的推移,其统计性质保持不变。对于任意的τ,其联合概率密度函数满足:

      严平稳的条件只是理论上的存在,现实中用得比较多的是宽平稳的条件。

    宽平稳也叫弱平稳或者二阶平稳(均值和方差平稳),它应满足:

    • 常数均值
    • 常数方差
    • 常数自协方差

     

      ARIMA 模型对时间序列的要求是平稳型。因此,当你得到一个非平稳的时间序列时,首先要做的即是做时间序列的差分,直到得到一个平稳时间序列。如果你对时间序列做d次差分才能得到一个平稳序列,那么可以使用ARIMA(p,d,q)模型,其中d是差分次数。

    二阶差分是指在一阶差分基础上再做一阶差分。

    %load_ext autoreload
    %autoreload 2
    %matplotlib inline
    %config InlineBackend.figure_format='retina'
    
    from __future__ import absolute_import, division, print_function
    # http://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost各种python库文件的下载,基本可以找到所有的
    import sys
    import os
    
    import pandas as pd
    import numpy as np
    
    # # Remote Data Access
    # import pandas_datareader.data as web
    # import datetime
    # # reference: https://pandas-datareader.readthedocs.io/en/latest/remote_data.html
    
    # TSA from Statsmodels
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    import statsmodels.tsa.api as smt
    
    # Display and Plotting
    import matplotlib.pylab as plt
    import seaborn as sns
    
    pd.set_option('display.float_format', lambda x: '%.5f' % x) # pandas
    np.set_printoptions(precision=5, suppress=True) # numpy
    
    pd.set_option('display.max_columns', 100)
    pd.set_option('display.max_rows', 100)
    
    # seaborn plotting style
    sns.set(style='ticks', context='poster')
    结果:
    The autoreload extension is already loaded. To reload it, use:
      %reload_ext autoreload
    

      

    #Read the data
    #美国消费者信心指数
    Sentiment = 'data/sentiment.csv'
    Sentiment = pd.read_csv(Sentiment, index_col=0, parse_dates=[0])
    

      

    Sentiment.head()
    

      结果:

     UMCSENT
    DATE 
    2000-01-01 112.00000
    2000-02-01 111.30000
    2000-03-01 107.10000
    2000-04-01 109.20000
    2000-05-01 110.70000
    # Select the series from 2005 - 2016
    sentiment_short = Sentiment.loc['2005':'2016']
    

      

    sentiment_short.plot(figsize=(12,8))
    plt.legend(bbox_to_anchor=(1.25, 0.5))
    plt.title("Consumer Sentiment")
    sns.despine()
    

      结果:

    sentiment_short['diff_1'] = sentiment_short['UMCSENT'].diff(1)#求差分值,一阶差分。     1指的是1个时间间隔,可更改。
    
    sentiment_short['diff_2'] = sentiment_short['diff_1'].diff(1)#再求差分,二阶差分。
    
    sentiment_short.plot(subplots=True, figsize=(18, 12))
    

      结果:

    array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001D9383BACF8>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001D939FAB6A0>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001D93A139B70>], dtype=object)
    View Code

    ARIMA模型:

    相关函数评估方法:

    通过ACF和PACF的图选择出p值和q值。

     建立ARIMA模型:

    del sentiment_short['diff_2']
    del sentiment_short['diff_1']
    sentiment_short.head()
    print (type(sentiment_short))#<class 'pandas.core.frame.DataFrame'>
    

      

    fig = plt.figure(figsize=(12,8))
    #acf
    ax1 = fig.add_subplot(211)
    fig = sm.graphics.tsa.plot_acf(sentiment_short, lags=20,ax=ax1)
    ax1.xaxis.set_ticks_position('bottom')
    fig.tight_layout();
    #pacf
    ax2 = fig.add_subplot(212)
    fig = sm.graphics.tsa.plot_pacf(sentiment_short, lags=20, ax=ax2)
    ax2.xaxis.set_ticks_position('bottom')
    fig.tight_layout();
    #下图中的阴影表示置信区间,可以看出不同阶数自相关性的变化情况,从而选出p值和q值
    

      结果:

     

    # 散点图也可以表示
    
    lags=9
    
    ncols=3
    nrows=int(np.ceil(lags/ncols))
    
    fig, axes = plt.subplots(ncols=ncols, nrows=nrows, figsize=(4*ncols, 4*nrows))
    
    for ax, lag in zip(axes.flat, np.arange(1,lags+1, 1)):
        lag_str = 't-{}'.format(lag)
        X = (pd.concat([sentiment_short, sentiment_short.shift(-lag)], axis=1,
                       keys=['y'] + [lag_str]).dropna())
    
        X.plot(ax=ax, kind='scatter', y='y', x=lag_str);
        corr = X.corr().as_matrix()[0][1]
        ax.set_ylabel('Original')
        ax.set_title('Lag: {} (corr={:.2f})'.format(lag_str, corr));
        ax.set_aspect('equal');
        sns.despine();
    
    fig.tight_layout();
    

      结果:

    # 更直观一些
    #模板,使用时直接改自己的数据就行,用以下四个图进行评估和分析就可以
    def tsplot(y, lags=None, title='', figsize=(14, 8)):
       
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax   = plt.subplot2grid(layout, (0, 0))
        hist_ax = plt.subplot2grid(layout, (0, 1))
        acf_ax  = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        
        y.plot(ax=ts_ax)
        ts_ax.set_title(title)
        y.plot(ax=hist_ax, kind='hist', bins=25)
        hist_ax.set_title('Histogram')
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
        [ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
        sns.despine()
        plt.tight_layout()
        return ts_ax, acf_ax, pacf_ax
    
    tsplot(sentiment_short, title='Consumer Sentiment', lags=36);
    

      结果:

    参数选择:

    BIC的结果受样本的影响,使用同一样本时,可以选择BIC。

    %load_ext autoreload
    %autoreload 2
    %matplotlib inline
    %config InlineBackend.figure_format='retina'
    
    from __future__ import absolute_import, division, print_function
    
    import sys
    import os
    
    import pandas as pd
    import numpy as np
    
    # TSA from Statsmodels
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    import statsmodels.tsa.api as smt
    
    # Display and Plotting
    import matplotlib.pylab as plt
    import seaborn as sns
    
    pd.set_option('display.float_format', lambda x: '%.5f' % x) # pandas
    np.set_printoptions(precision=5, suppress=True) # numpy
    
    pd.set_option('display.max_columns', 100)
    pd.set_option('display.max_rows', 100)
    
    # seaborn plotting style
    sns.set(style='ticks', context='poster')
    

      

    filename_ts = 'data/series1.csv'
    ts_df = pd.read_csv(filename_ts, index_col=0, parse_dates=[0])
    
    n_sample = ts_df.shape[0]
    

      

    print(ts_df.shape)
    print(ts_df.head())
    结果:
    (120, 1)
                  value
    2006-06-01  0.21507
    2006-07-01  1.14225
    2006-08-01  0.08077
    2006-09-01 -0.73952
    2006-10-01  0.53552
    

      

    # Create a training sample and testing sample before analyzing the series
    
    n_train=int(0.95*n_sample)+1
    n_forecast=n_sample-n_train
    #ts_df
    ts_train = ts_df.iloc[:n_train]['value']
    ts_test = ts_df.iloc[n_train:]['value']
    print(ts_train.shape)
    print(ts_test.shape)
    print("Training Series:", "
    ", ts_train.tail(), "
    ")
    print("Testing Series:", "
    ", ts_test.head())
    

      结果:

    (115,)
    (5,)
    Training Series: 
     2015-08-01    0.60371
    2015-09-01   -1.27372
    2015-10-01   -0.93284
    2015-11-01    0.08552
    2015-12-01    1.20534
    Name: value, dtype: float64 
    
    Testing Series: 
     2016-01-01    2.16411
    2016-02-01    0.95226
    2016-03-01    0.36485
    2016-04-01   -2.26487
    2016-05-01   -2.38168
    Name: value, dtype: float64
    View Code
    def tsplot(y, lags=None, title='', figsize=(14, 8)):
        
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax   = plt.subplot2grid(layout, (0, 0))
        hist_ax = plt.subplot2grid(layout, (0, 1))
        acf_ax  = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        
        y.plot(ax=ts_ax)
        ts_ax.set_title(title)
        y.plot(ax=hist_ax, kind='hist', bins=25)
        hist_ax.set_title('Histogram')
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
        [ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
        sns.despine()
        fig.tight_layout()
        return ts_ax, acf_ax, pacf_ax
    

      

    tsplot(ts_train, title='A Given Training Series', lags=20);
    

      结果:

    #Model Estimation
    
    # Fit the model
    arima200 = sm.tsa.SARIMAX(ts_train, order=(2,0,0))#order里边的三个参数p,d,q
    model_results = arima200.fit()#fit模型
    

      

    import itertools
    #当多组值都不符合时,遍历多组值,得出最好的值
    p_min = 0
    d_min = 0
    q_min = 0
    p_max = 4
    d_max = 0
    q_max = 4
    
    # Initialize a DataFrame to store the results
    results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min,p_max+1)],
                               columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
    
    for p,d,q in itertools.product(range(p_min,p_max+1),
                                   range(d_min,d_max+1),
                                   range(q_min,q_max+1)):
        if p==0 and d==0 and q==0:
            results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = np.nan
            continue
        
        try:
            model = sm.tsa.SARIMAX(ts_train, order=(p, d, q),
                                   #enforce_stationarity=False,
                                   #enforce_invertibility=False,
                                  )
            results = model.fit()
            results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = results.bic
        except:
            continue
    results_bic = results_bic[results_bic.columns].astype(float)
    

      

    fig, ax = plt.subplots(figsize=(10, 8))
    ax = sns.heatmap(results_bic,
                     mask=results_bic.isnull(),
                     ax=ax,
                     annot=True,
                     fmt='.2f',
                     );
    ax.set_title('BIC');
    

      结果:

    # Alternative model selection method, limited to only searching AR and MA parameters
    
    train_results = sm.tsa.arma_order_select_ic(ts_train, ic=['aic', 'bic'], trend='nc', max_ar=4, max_ma=4)
    
    print('AIC', train_results.aic_min_order)
    print('BIC', train_results.bic_min_order)
    结果:得出两个不同的标准,比较尴尬,还需要进行筛选
    AIC (4, 2)
    BIC (1, 1)
    

    #残差分析 正态分布 QQ图线性
    model_results.plot_diagnostics(figsize=(16, 12));#statsmodels库
    

      结果:

    Q-Q图:越像直线,则是正态分布;越不是直线,离正态分布越远。

    时间序列建模基本步骤:

    1. 获取被观测系统时间序列数据;
    2. 对数据绘图,观测是否为平稳时间序列;对于非平稳时间序列要先进行d阶差分运算,化为平稳时间序列;
    3. 经过第二步处理,已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF 和偏自相关系数PACF ,通过对自相关图和偏自相关图的分析,得到最佳的阶层 p 和阶数 q
    4. 由以上得到的 ,得到ARIMA模型。然后开始对得到的模型进行模型检验。

    股票预测(属于回归):

    %matplotlib inline
    import pandas as pd
    import pandas_datareader#用于从雅虎财经获取股票数据
    import datetime
    import matplotlib.pylab as plt
    import seaborn as sns
    from matplotlib.pylab import style
    from statsmodels.tsa.arima_model import ARIMA
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    
    style.use('ggplot')    
    plt.rcParams['font.sans-serif'] = ['SimHei'] 
    plt.rcParams['axes.unicode_minus'] = False  
    

      

    stockFile = 'data/T10yr.csv'
    stock = pd.read_csv(stockFile, index_col=0, parse_dates=[0])#将索引index设置为时间,parse_dates对日期格式处理为标准格式。
    stock.head(10)
    

      结果:

     OpenHighLowCloseVolumeAdj Close
    Date      
    2000-01-03 6.498 6.603 6.498 6.548 0 6.548
    2000-01-04 6.530 6.548 6.485 6.485 0 6.485
    2000-01-05 6.521 6.599 6.508 6.599 0 6.599
    2000-01-06 6.558 6.585 6.540 6.549 0 6.549
    2000-01-07 6.545 6.595 6.504 6.504 0 6.504
    2000-01-10 6.540 6.567 6.536 6.558 0 6.558
    2000-01-11 6.600 6.664 6.595 6.664 0 6.664
    2000-01-12 6.659 6.696 6.645 6.696 0 6.696
    2000-01-13 6.664 6.705 6.618 6.618 0 6.618
    2000-01-14 6.623 6.688 6.563 6.674 0 6.674
    stock_week = stock['Close'].resample('W-MON').mean()
    stock_train = stock_week['2000':'2015']
    

      

    stock_train.plot(figsize=(12,8))
    plt.legend(bbox_to_anchor=(1.25, 0.5))
    plt.title("Stock Close")
    sns.despine()
    

      结果:

    stock_diff = stock_train.diff()
    stock_diff = stock_diff.dropna()
    
    plt.figure()
    plt.plot(stock_diff)
    plt.title('一阶差分')
    plt.show()
    

      结果:

    acf = plot_acf(stock_diff, lags=20)
    plt.title("ACF")
    acf.show()
    

      结果:

    pacf = plot_pacf(stock_diff, lags=20)
    plt.title("PACF")
    pacf.show()
    

      结果:

    model = ARIMA(stock_train, order=(1, 1, 1),freq='W-MON')
    

      

    result = model.fit()
    #print(result.summary())#统计出ARIMA模型的指标
    

      

    pred = result.predict('20140609', '20160701',dynamic=True, typ='levels')#预测,指定起始与终止时间。预测值起始时间必须在原始数据中,终止时间不需要
    print (pred)
    结果:
    2014-06-09    2.463559
    2014-06-16    2.455539
    2014-06-23    2.449569
    2014-06-30    2.444183
    2014-07-07    2.438962
    2014-07-14    2.433788
    2014-07-21    2.428627
    2014-07-28    2.423470
    2014-08-04    2.418315
    2014-08-11    2.413159
    2014-08-18    2.408004
    2014-08-25    2.402849
    2014-09-01    2.397693
    2014-09-08    2.392538
    2014-09-15    2.387383
    2014-09-22    2.382227
    2014-09-29    2.377072
    2014-10-06    2.371917
    2014-10-13    2.366761
    2014-10-20    2.361606
    2014-10-27    2.356451
    2014-11-03    2.351296
    2014-11-10    2.346140
    2014-11-17    2.340985
    2014-11-24    2.335830
    2014-12-01    2.330674
    2014-12-08    2.325519
    2014-12-15    2.320364
    2014-12-22    2.315208
    2014-12-29    2.310053
                    ...   
    2015-12-07    2.057443
    2015-12-14    2.052288
    2015-12-21    2.047132
    2015-12-28    2.041977
    2016-01-04    2.036822
    2016-01-11    2.031666
    2016-01-18    2.026511
    2016-01-25    2.021356
    2016-02-01    2.016200
    2016-02-08    2.011045
    2016-02-15    2.005890
    2016-02-22    2.000735
    2016-02-29    1.995579
    2016-03-07    1.990424
    2016-03-14    1.985269
    2016-03-21    1.980113
    2016-03-28    1.974958
    2016-04-04    1.969803
    2016-04-11    1.964647
    2016-04-18    1.959492
    2016-04-25    1.954337
    2016-05-02    1.949181
    2016-05-09    1.944026
    2016-05-16    1.938871
    2016-05-23    1.933716
    2016-05-30    1.928560
    2016-06-06    1.923405
    2016-06-13    1.918250
    2016-06-20    1.913094
    2016-06-27    1.907939
    Freq: W-MON, Length: 108, dtype: float64
    

      

    plt.figure(figsize=(6, 6))
    plt.xticks(rotation=45)
    plt.plot(pred)
    plt.plot(stock_train)#[<matplotlib.lines.Line2D at 0x28025665278>]
    

      结果:

    使用tfresh库进行分类任务:

    tsfresh是开源的提取时序数据特征的python包,能够提取出超过64种特征,堪称提取时序特征的瑞士军刀。用到时tfresh查官方文档

    %matplotlib inline
    import matplotlib.pylab as plt
    import seaborn as sns
    from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, load_robot_execution_failures
    from tsfresh import extract_features, extract_relevant_features, select_features
    from tsfresh.utilities.dataframe_functions import impute
    from tsfresh.feature_extraction import ComprehensiveFCParameters
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.cross_validation import train_test_split
    from sklearn.metrics import classification_report
    
    
    #http://tsfresh.readthedocs.io/en/latest/text/quick_start.html#官方文档
    

      

    download_robot_execution_failures()
    df, y = load_robot_execution_failures()
    df.head()
    

      结果:

    id    time    a    b    c    d    e    f
    0    1    0    -1    -1    63    -3    -1    0
    1    1    1    0    0    62    -3    -1    0
    2    1    2    -1    -1    61    -3    0    0
    3    1    3    -1    -1    63    -2    -1    0
    4    1    4    -1    -1    63    -3    -1    0
    View Code
    df[df.id == 3][['time', 'a', 'b', 'c', 'd', 'e', 'f']].plot(x='time', title='Success example (id 3)', figsize=(12, 6));
    df[df.id == 20][['time', 'a', 'b', 'c', 'd', 'e', 'f']].plot(x='time', title='Failure example (id 20)', figsize=(12, 6));
    

      结果:

    extraction_settings = ComprehensiveFCParameters()#提取特征
    

      

    #column_id (str) – The name of the id column to group by
    #column_sort (str) – The name of the sort column.
    X = extract_features(df, 
                         column_id='id', column_sort='time',#以id为聚合,以time排序
                         default_fc_parameters=extraction_settings,
                         impute_function= impute)
    

      

    X.head()#提取到的特征
    

      结果:

    a__mean_abs_change_quantiles__qh_1.0__ql_0.8    a__percentage_of_reoccurring_values_to_all_values    a__mean_abs_change_quantiles__qh_1.0__ql_0.2    a__mean_abs_change_quantiles__qh_1.0__ql_0.0    a__large_standard_deviation__r_0.45    a__absolute_sum_of_changes    a__mean_abs_change_quantiles__qh_1.0__ql_0.4    a__mean_second_derivate_central    a__autocorrelation__lag_4    a__binned_entropy__max_bins_10    ...    f__fft_coefficient__coeff_0    f__fft_coefficient__coeff_1    f__fft_coefficient__coeff_2    f__fft_coefficient__coeff_3    f__fft_coefficient__coeff_4    f__fft_coefficient__coeff_5    f__fft_coefficient__coeff_6    f__fft_coefficient__coeff_7    f__fft_coefficient__coeff_8    f__fft_coefficient__coeff_9
    id                                                                                    
    1    0.142857    0.933333    0.142857    0.142857    0.0    2.0    0.142857    -0.038462    0.17553    0.244930    ...    0.0    0.000000    0.000000    0.000000    0.000000    0.0    0.000000    0.000000    0.0    0.0
    2    0.000000    1.000000    0.400000    1.000000    0.0    14.0    0.400000    -0.038462    0.17553    0.990835    ...    -4.0    0.744415    1.273659    -0.809017    1.373619    0.5    0.309017    -1.391693    0.0    0.0
    3    0.000000    0.933333    0.714286    0.714286    0.0    10.0    0.714286    -0.038462    0.17553    0.729871    ...    -4.0    -0.424716    0.878188    1.000000    1.851767    0.5    1.000000    -2.805239    0.0    0.0
    4    0.000000    1.000000    0.800000    1.214286    0.0    17.0    0.800000    -0.038462    0.17553    1.322950    ...    -5.0    -1.078108    3.678858    -3.618034    -1.466977    -0.5    -1.381966    -0.633773    0.0    0.0
    5    2.000000    0.866667    0.916667    0.928571    0.0    13.0    0.916667    0.038462    0.17553    1.020037    ...    -2.0    -3.743460    3.049653    -0.618034    1.198375    -0.5    1.618034    -0.004568    0.0    0.0
    5 rows × 1332 columns
    View Code
    X.info()
    #结果:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 88 entries, 1 to 88
    Columns: 1332 entries, a__mean_abs_change_quantiles__qh_1.0__ql_0.8 to f__fft_coefficient__coeff_9
    dtypes: float64(1332)
    memory usage: 916.4 KB
    

      

    X_filtered = extract_relevant_features(df, y, 
                                           column_id='id', column_sort='time', 
                                           default_fc_parameters=extraction_settings)#特征过滤,选择最相关的特征。具体了解查看官方文档
    

      

    X_filtered.head()#新特征
    

      结果:

    a__abs_energy    a__range_count__max_1__min_-1    b__abs_energy    e__variance    e__standard_deviation    e__abs_energy    c__standard_deviation    c__variance    a__standard_deviation    a__variance    ...    b__has_duplicate_max    b__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_14__w_5    b__cwt_coefficients__widths_(2, 5, 10, 20)__coeff_13__w_2    e__quantile__q_0.1    a__ar_coefficient__k_10__coeff_1    a__quantile__q_0.2    b__quantile__q_0.7    f__large_standard_deviation__r_0.35    f__quantile__q_0.9    d__spkt_welch_density__coeff_5
    id                                                                                    
    1    14.0    15.0    13.0    0.222222    0.471405    10.0    1.203698    1.448889    0.249444    0.062222    ...    1.0    -0.751682    -0.310265    -1.0    0.125000    -1.0    -1.0    0.0    0.0    0.037795
    2    25.0    13.0    76.0    4.222222    2.054805    90.0    4.333846    18.782222    0.956847    0.915556    ...    1.0    0.057818    -0.202951    -3.6    -0.078829    -1.0    -1.0    1.0    0.0    0.319311
    3    12.0    14.0    40.0    3.128889    1.768867    103.0    4.616877    21.315556    0.596285    0.355556    ...    0.0    0.912474    0.539121    -4.0    0.084836    -1.0    0.0    1.0    0.0    9.102780
    4    16.0    10.0    60.0    7.128889    2.669998    124.0    3.833188    14.693333    0.952190    0.906667    ...    0.0    -0.609735    -2.641390    -4.6    0.003108    -1.0    1.0    0.0    0.0    56.910262
    5    17.0    13.0    46.0    4.160000    2.039608    180.0    4.841487    23.440000    0.879394    0.773333    ...    0.0    0.072771    0.591927    -5.0    0.087906    -1.0    0.8    0.0    0.6    22.841805
    5 rows × 300 columns
    View Code
    X_filtered.info()
    

      结果:

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 88 entries, 1 to 88
    Columns: 300 entries, a__abs_energy to d__spkt_welch_density__coeff_5
    dtypes: float64(300)
    memory usage: 206.9 KB
    View Code
    X_train, X_test, X_filtered_train, X_filtered_test, y_train, y_test = train_test_split(X, X_filtered, y, test_size=.4)
    

      

    cl = DecisionTreeClassifier()
    cl.fit(X_train, y_train)
    print(classification_report(y_test, cl.predict(X_test)))#对模型进行评估,可以看出这个结果还不错
    

      结果:

    precision    recall  f1-score   support
    
              0       1.00      0.89      0.94         9
              1       0.96      1.00      0.98        27
    
    avg / total       0.97      0.97      0.97        36
    cl.n_features_#1332
    

      

    cl2 = DecisionTreeClassifier()
    cl2.fit(X_filtered_train, y_train)
    print(classification_report(y_test, cl2.predict(X_filtered_test)))
    

      结果:

    cl2 = DecisionTreeClassifier()
    cl2.fit(X_filtered_train, y_train)
    print(classification_report(y_test, cl2.predict(X_filtered_test)))
    cl2 = DecisionTreeClassifier()
    cl2.fit(X_filtered_train, y_train)
    print(classification_report(y_test, cl2.predict(X_filtered_test)))
                 precision    recall  f1-score   support
    
              0       1.00      0.78      0.88         9
              1       0.93      1.00      0.96        27
    
    avg / total       0.95      0.94      0.94        36
    View Code
    cl2.n_features_#300
    

    维基百科词条EDA

    探索性数据分析(EDA)目的是最大化对数据的直觉,完成这个事情的方法只能是结合统计学的图形以各种形式展现出来。通过EDA可以实现: 
    1. 得到数据的直观表现 
    2. 发现潜在的结构 
    3. 提取重要的变量 
    4. 处理异常值 
    5. 检验统计假设 
    6. 建立初步模型 
    7. 决定最优因子的设置

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import re
    %matplotlib inline
    

      

    train = pd.read_csv('train_1.csv').fillna(0)
    train.head()
    

      结果:

    Page    2015-07-01    2015-07-02    2015-07-03    2015-07-04    2015-07-05    2015-07-06    2015-07-07    2015-07-08    2015-07-09    ...    2016-12-22    2016-12-23    2016-12-24    2016-12-25    2016-12-26    2016-12-27    2016-12-28    2016-12-29    2016-12-30    2016-12-31
    0    2NE1_zh.wikipedia.org_all-access_spider    18.0    11.0    5.0    13.0    14.0    9.0    9.0    22.0    26.0    ...    32.0    63.0    15.0    26.0    14.0    20.0    22.0    19.0    18.0    20.0
    1    2PM_zh.wikipedia.org_all-access_spider    11.0    14.0    15.0    18.0    11.0    13.0    22.0    11.0    10.0    ...    17.0    42.0    28.0    15.0    9.0    30.0    52.0    45.0    26.0    20.0
    2    3C_zh.wikipedia.org_all-access_spider    1.0    0.0    1.0    1.0    0.0    4.0    0.0    3.0    4.0    ...    3.0    1.0    1.0    7.0    4.0    4.0    6.0    3.0    4.0    17.0
    3    4minute_zh.wikipedia.org_all-access_spider    35.0    13.0    10.0    94.0    4.0    26.0    14.0    9.0    11.0    ...    32.0    10.0    26.0    27.0    16.0    11.0    17.0    19.0    10.0    11.0
    4    52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    ...    48.0    9.0    25.0    13.0    3.0    11.0    27.0    13.0    36.0    10.0
    5 rows × 551 columns
    View Code
    train.info()
    结果:<class 'pandas.core.frame.DataFrame'>
    RangeIndex: 145063 entries, 0 to 145062
    Columns: 551 entries, Page to 2016-12-31
    dtypes: float64(550), object(1)
    memory usage: 609.8+ MB
    

      

    for col in train.columns[1:]:
        train[col] = pd.to_numeric(train[col],downcast='integer')#float数据较为占内存,从上表可以看出,小数点后都是0,可将数据转换为int,减小内存。
    train.head()
    

      结果:

    Page    2015-07-01    2015-07-02    2015-07-03    2015-07-04    2015-07-05    2015-07-06    2015-07-07    2015-07-08    2015-07-09    ...    2016-12-22    2016-12-23    2016-12-24    2016-12-25    2016-12-26    2016-12-27    2016-12-28    2016-12-29    2016-12-30    2016-12-31
    0    2NE1_zh.wikipedia.org_all-access_spider    18    11    5    13    14    9    9    22    26    ...    32    63    15    26    14    20    22    19    18    20
    1    2PM_zh.wikipedia.org_all-access_spider    11    14    15    18    11    13    22    11    10    ...    17    42    28    15    9    30    52    45    26    20
    2    3C_zh.wikipedia.org_all-access_spider    1    0    1    1    0    4    0    3    4    ...    3    1    1    7    4    4    6    3    4    17
    3    4minute_zh.wikipedia.org_all-access_spider    35    13    10    94    4    26    14    9    11    ...    32    10    26    27    16    11    17    19    10    11
    4    52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...    0    0    0    0    0    0    0    0    0    ...    48    9    25    13    3    11    27    13    36    10
    5 rows × 551 columns
    View Code
    train.info()
    结果:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 145063 entries, 0 to 145062
    Columns: 551 entries, Page to 2016-12-31
    dtypes: int32(550), object(1)
    memory usage: 305.5+ MB
    

      

    def get_language(page):#将词条按国家分类
        res = re.search('[a-z][a-z].wikipedia.org',page)
        #print (res.group()[0:2])
        if res:
            return res.group()[0:2]
        return 'na'
    
    train['lang'] = train.Page.map(get_language)
    
    from collections import Counter
    
    print(Counter(train.lang))
    

      结果:Counter({'en': 24108, 'ja': 20431, 'de': 18547, 'na': 17855, 'fr': 17802, 'zh': 17229, 'ru': 15022, 'es': 14069})

    lang_sets = {}
    lang_sets['en'] = train[train.lang=='en'].iloc[:,0:-1]
    lang_sets['ja'] = train[train.lang=='ja'].iloc[:,0:-1]
    lang_sets['de'] = train[train.lang=='de'].iloc[:,0:-1]
    lang_sets['na'] = train[train.lang=='na'].iloc[:,0:-1]
    lang_sets['fr'] = train[train.lang=='fr'].iloc[:,0:-1]
    lang_sets['zh'] = train[train.lang=='zh'].iloc[:,0:-1]
    lang_sets['ru'] = train[train.lang=='ru'].iloc[:,0:-1]
    lang_sets['es'] = train[train.lang=='es'].iloc[:,0:-1]
    
    sums = {}
    for key in lang_sets:
        sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0) / lang_sets[key].shape[0]
    

      

    days = [r for r in range(sums['en'].shape[0])]
    
    fig = plt.figure(1,figsize=[10,10])
    plt.ylabel('Views per Page')
    plt.xlabel('Day')
    plt.title('Pages in Different Languages')
    labels={'en':'English','ja':'Japanese','de':'German',
            'na':'Media','fr':'French','zh':'Chinese',
            'ru':'Russian','es':'Spanish'
           }
    
    for key in sums:
        plt.plot(days,sums[key],label = labels[key] )
        
    plt.legend()
    plt.show()
    

      结果:

    def plot_entry(key,idx):
        data = lang_sets[key].iloc[idx,1:]
        fig = plt.figure(1,figsize=(10,5))
        plt.plot(days,data)
        plt.xlabel('day')
        plt.ylabel('views')
        plt.title(train.iloc[lang_sets[key].index[idx],0])
        
        plt.show()
    

      

    idx = [1, 5, 10, 50, 100, 250,500, 750,1000,1500,2000,3000,4000,5000]
    for i in idx:#按词条分类
        plot_entry('en',i)
    

      结果:

    npages = 5
    top_pages = {}
    for key in lang_sets:
        print(key)
        sum_set = pd.DataFrame(lang_sets[key][['Page']])
        sum_set['total'] = lang_sets[key].sum(axis=1)
        sum_set = sum_set.sort_values('total',ascending=False)
        print(sum_set.head(10))
        top_pages[key] = sum
    

      结果:

    zh
                                                         Page      total
    28727   Wikipedia:首页_zh.wikipedia.org_all-access_all-a...  123694312
    61350    Wikipedia:首页_zh.wikipedia.org_desktop_all-agents   66435641
    105844  Wikipedia:首页_zh.wikipedia.org_mobile-web_all-a...   50887429
    28728   Special:搜索_zh.wikipedia.org_all-access_all-agents   48678124
    61351      Special:搜索_zh.wikipedia.org_desktop_all-agents   48203843
    28089   Running_Man_zh.wikipedia.org_all-access_all-ag...   11485845
    30960   Special:链接搜索_zh.wikipedia.org_all-access_all-a...   10320403
    63510    Special:链接搜索_zh.wikipedia.org_desktop_all-agents   10320336
    60711     Running_Man_zh.wikipedia.org_desktop_all-agents    7968443
    30446    瑯琊榜_(電視劇)_zh.wikipedia.org_all-access_all-agents    5891589
    
    
    
    fr
                                                         Page      total
    27330   Wikipédia:Accueil_principal_fr.wikipedia.org_a...  868480667
    55104   Wikipédia:Accueil_principal_fr.wikipedia.org_m...  611302821
    7344    Wikipédia:Accueil_principal_fr.wikipedia.org_d...  239589012
    27825   Spécial:Recherche_fr.wikipedia.org_all-access_...   95666374
    8221    Spécial:Recherche_fr.wikipedia.org_desktop_all...   88448938
    26500   Sp?cial:Search_fr.wikipedia.org_all-access_all...   76194568
    6978    Sp?cial:Search_fr.wikipedia.org_desktop_all-ag...   76185450
    131296  Wikipédia:Accueil_principal_fr.wikipedia.org_a...   63860799
    26993   Organisme_de_placement_collectif_en_valeurs_mo...   36647929
    7213    Organisme_de_placement_collectif_en_valeurs_mo...   36624145
    
    
    
    ru
                                                         Page       total
    99322   Заглавная_страница_ru.wikipedia.org_all-access...  1086019452
    103123  Заглавная_страница_ru.wikipedia.org_desktop_al...   742880016
    17670   Заглавная_страница_ru.wikipedia.org_mobile-web...   327930433
    99537   Служебная:Поиск_ru.wikipedia.org_all-access_al...   103764279
    103349  Служебная:Поиск_ru.wikipedia.org_desktop_all-a...    98664171
    100414  Служебная:Ссылки_сюда_ru.wikipedia.org_all-acc...    25102004
    104195  Служебная:Ссылки_сюда_ru.wikipedia.org_desktop...    25058155
    97670   Special:Search_ru.wikipedia.org_all-access_all...    24374572
    101457  Special:Search_ru.wikipedia.org_desktop_all-ag...    21958472
    98301   Служебная:Вход_ru.wikipedia.org_all-access_all...    12162587
    
    
    
    ja
                                                         Page      total
    120336      メインページ_ja.wikipedia.org_all-access_all-agents  210753795
    86431          メインページ_ja.wikipedia.org_desktop_all-agents  134147415
    123025       特別:検索_ja.wikipedia.org_all-access_all-agents   70316929
    89202           特別:検索_ja.wikipedia.org_desktop_all-agents   69215206
    57309       メインページ_ja.wikipedia.org_mobile-web_all-agents   66459122
    119609    特別:最近の更新_ja.wikipedia.org_all-access_all-agents   17662791
    88897        特別:最近の更新_ja.wikipedia.org_desktop_all-agents   17627621
    119625        真田信繁_ja.wikipedia.org_all-access_all-agents   10793039
    123292  特別:外部リンク検索_ja.wikipedia.org_all-access_all-agents   10331191
    89463      特別:外部リンク検索_ja.wikipedia.org_desktop_all-agents   10327917
    
    
    
    es
                                                         Page      total
    92205   Wikipedia:Portada_es.wikipedia.org_all-access_...  751492304
    95855   Wikipedia:Portada_es.wikipedia.org_mobile-web_...  565077372
    90810   Especial:Buscar_es.wikipedia.org_all-access_al...  194491245
    71199   Wikipedia:Portada_es.wikipedia.org_desktop_all...  165439354
    69939   Especial:Buscar_es.wikipedia.org_desktop_all-a...  160431271
    94389   Especial:Buscar_es.wikipedia.org_mobile-web_al...   34059966
    90813   Especial:Entrar_es.wikipedia.org_all-access_al...   33983359
    143440  Wikipedia:Portada_es.wikipedia.org_all-access_...   31615409
    93094   Lali_Espósito_es.wikipedia.org_all-access_all-...   26602688
    69942   Especial:Entrar_es.wikipedia.org_desktop_all-a...   25747141
    
    
    
    en
                                                        Page        total
    38573   Main_Page_en.wikipedia.org_all-access_all-agents  12066181102
    9774       Main_Page_en.wikipedia.org_desktop_all-agents   8774497458
    74114   Main_Page_en.wikipedia.org_mobile-web_all-agents   3153984882
    39180  Special:Search_en.wikipedia.org_all-access_all...   1304079353
    10403  Special:Search_en.wikipedia.org_desktop_all-ag...   1011847748
    74690  Special:Search_en.wikipedia.org_mobile-web_all...    292162839
    39172  Special:Book_en.wikipedia.org_all-access_all-a...    133993144
    10399   Special:Book_en.wikipedia.org_desktop_all-agents    133285908
    33644       Main_Page_en.wikipedia.org_all-access_spider    129020407
    34257  Special:Search_en.wikipedia.org_all-access_spider    124310206
    
    
    
    na
                                                        Page     total
    45071  Special:Search_commons.wikimedia.org_all-acces...  67150638
    81665  Special:Search_commons.wikimedia.org_desktop_a...  63349756
    45056  Special:CreateAccount_commons.wikimedia.org_al...  53795386
    45028  Main_Page_commons.wikimedia.org_all-access_all...  52732292
    81644  Special:CreateAccount_commons.wikimedia.org_de...  48061029
    81610  Main_Page_commons.wikimedia.org_desktop_all-ag...  39160923
    46078  Special:RecentChangesLinked_commons.wikimedia....  28306336
    45078  Special:UploadWizard_commons.wikimedia.org_all...  23733805
    81671  Special:UploadWizard_commons.wikimedia.org_des...  22008544
    82680  Special:RecentChangesLinked_commons.wikimedia....  21915202
    
    
    
    de
                                                         Page       total
    139119  Wikipedia:Hauptseite_de.wikipedia.org_all-acce...  1603934248
    116196  Wikipedia:Hauptseite_de.wikipedia.org_mobile-w...  1112689084
    67049   Wikipedia:Hauptseite_de.wikipedia.org_desktop_...   426992426
    140151  Spezial:Suche_de.wikipedia.org_all-access_all-...   223425944
    66736   Spezial:Suche_de.wikipedia.org_desktop_all-agents   219636761
    140147  Spezial:Anmelden_de.wikipedia.org_all-access_a...    40291806
    138800  Special:Search_de.wikipedia.org_all-access_all...    39881543
    68104   Spezial:Anmelden_de.wikipedia.org_desktop_all-...    35355226
    68511   Special:MyPage/toolserverhelferleinconfig.js_d...    32584955
    137765  Hauptseite_de.wikipedia.org_all-access_all-agents    31732458
    View Code
    for key in top_pages:
        fig = plt.figure(1,figsize=(10,5))
        cols = train.columns
        cols = cols[1:-1]
        data = train.loc[top_pages[key],cols]
        plt.plot(days,data)
        plt.xlabel('Days')
        plt.ylabel('Views')
        plt.title(train.loc[top_pages[key],'Page'])
        plt.show()
    

      结果:

      

  • 相关阅读:
    怎么实现将word中的公式导入(或粘贴)到百度编辑中
    怎么实现将word中的公式导入(或粘贴)到编辑中ueditor
    利用 Postman 中 Tests 断言校验返回结果
    java 笔记
    mac java找他绝对路径的方法
    Jmeter与BlazeMeter使用 录制导出jmx
    爬取彩票信息(有空试下)
    jmeter 查看结果树,获取响应体写法校验是否提取写法是否正确的方法
    当压测数据压不上去时可能是哪些原因造成的
    java-selenium 框架例子
  • 原文地址:https://www.cnblogs.com/tianqizhi/p/9277376.html
Copyright © 2011-2022 走看看