对如下数据进行异常检测,显然红圈中的两个点是异常点。
1、 使用指标绝对值进行异常检测
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import svm # 读取数据 df = pd.read_csv(r'indicator.csv', sep=',') df = df.fillna(method='ffill') df['time'] = range(120) plt.figure() plt.scatter(df['time'], df['indicator']) plt.show() # 使用绝对值预测 # reshape(-1, 1)矩阵转化为一列 data = np.array(df['indicator']).reshape(-1, 1) # 使用oneclasssvm algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) # 异常检测结果画图 df1 = df.copy() df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
使用OneClassSVM检测,结果如下:异常点没有检测出来,正常点反而被检测为异常。
显然时间序列中我们并没有考虑时间因素,于是我们可以在检测中引入时间因素。
2、 使用指标绝对值+时间序列进行异常检测
# 使用绝对值+时间预测 data = np.array(df).reshape(-1, 2) # 使用oneclasssvm algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) # 异常检测结果画图 df1 = df.copy() df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
使用OneClassSVM检测,结果如下:异常点检测出来了,但是部分正常点依然被检测为异常点。(图截取有问题,就不重做了)
例如对于手机流量进行检测,上班时流量使用较少,中午或晚上休息时对手机流量使用较多,我们仅仅使用绝对值进行检测,显然可能把中午流量使用较多的时刻或者上班时流量使用较少的时刻检测为异常点,实际上这些点时正常的。
很多情况下,指标变化是连续的,类似流量速率,网站访问率,cpu使用率,所以我们可以使用一阶差分(指标变化速率)或者二阶差分来进行异常检测。
3、 使用指标一阶差分进行异常检测
# 使用指标一阶差分进行异常检测 data = np.array(df['indicator']).reshape(-1, 1) data1 = data.copy() data2 = data.copy() data1 = np.delete(data1, 0, 0) data2 = np.delete(data2, 119, 0) data = data1 - data2 # 使用oneclasssvm algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) # 异常检测结果画图 df1 = df[1:] df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
可以看到异常检测结果符合我们的预期。
上面的数据过于对称,不符合实际情况,我们稍微修改一下数据。如下,红圈中的两个点为异常点。
我们继续使用一阶差分进行异常检测,结果如下
可以看到,异常点是检测出来了,但是不少正常点也被检测为异常点了。实际上,很多时间序列数据具有季节性的,同一个周期内不同季节有不同的表现,是正常的。比如植物在春夏生长迅速,秋冬生长缓慢,你不能认为秋冬生长缓慢就是异常的。只有春夏中,生长缓慢才是异常的。
所以,对于时间序列的异常检测,我们不得不考虑其周期性。一般来说,我们监控的指标具有天的周期性,我们怎么判断其是否具有周期性呢。我们可以通过计算自相关系数,判断其周期性强度。自相关系数计算公式如下
其中,k为周期,表示时间序列与自身间隔k个时间点的序列的协方差,特别的,表示方差。自相关系数取值范围为[0, 1],数值越大,自相关性越高,周期性越强。
4、 计算自相关系数
# 自相关系数 rk = yk/y0 u = df['indicator'].mean() s1 = df['indicator'][: -24] s2 = df['indicator'][24:] s1 = s1 - u s2 = s2 - u # 索引对应相乘 s0 = s1 * s1 # 矩阵对应位置相乘 sk = np.multiply(s1, s2) y0 = sum(s0) yk = sum(sk) rk = yk / y0 print(rk)
以24为周期,计算得到自相关系数为0.9706564686677905,表明数据具有24的周期性。
5、 使用指标同比进行异常检测
对于具有季节性的时间序列,我们采用对应季节数据进行比较。上述数据以24为周期,所以我们采用当前时间点数据与24个 时间点之前的数据进行差分计算,再进行异常检测。
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import svm df = pd.read_csv(r'indicator.csv', sep=',') df = df.fillna(method='ffill') df['time'] = range(120) # 自相关系数 rk = yk/y0 u = df['indicator'].mean() s1 = df['indicator'][: -24] s2 = df['indicator'][24:] s1 = s1 - u s2 = s2 - u # 索引对应相乘 s0 = s1 * s1 # 矩阵对应位置相乘 sk = np.multiply(s1, s2) y0 = sum(s0) yk = sum(sk) rk = yk / y0 print(rk) # reshape(-1, 1)矩阵转化为一列 data = np.array(df['indicator']).reshape(-1, 1) # 同比 data1 = data.copy() data2 = data.copy() data1 = data1[24:] data2 = data2[:-24] data = data1 - data2 algorithm = svm.OneClassSVM(nu=0.5, kernel='rbf', gamma=0.1) model = algorithm.fit(data) pre_y = model.predict(data) df1 = df.copy() df1 = df1[24:] df1['clazz'] = pre_y df2 = df1[df1['clazz'] == 1] df3 = df1[df1['clazz'] == -1] plt.figure() plt.scatter(df2['time'], df2['indicator']) plt.scatter(df3['time'], df3['indicator']) plt.show()
检测结果和我们预期相符。
上述的具有周期性的时间序列,也叫平稳时间序列。平稳时间序列的异常检测与非平稳序列的异常检测方式通常不一致。
测试数据
time,indicator
2018-11-02-02 00:00:00,1
2018-11-02-02 01:00:00,2
2018-11-02-02 02:00:00,3
2018-11-02-02 03:00:00,4
2018-11-02-02 04:00:00,5
2018-11-02-02 05:00:00,6
2018-11-02-02 06:00:00,7
2018-11-02-02 07:00:00,8
2018-11-02-02 08:00:00,9
2018-11-02-02 09:00:00,10
2018-11-02-02 10:00:00,11
2018-11-02-02 11:00:00,12
2018-11-02-02 12:00:00,13
2018-11-02-02 13:00:00,14
2018-11-02-02 14:00:00,15
2018-11-02-02 15:00:00,16
2018-11-02-02 16:00:00,17
2018-11-02-02 17:00:00,18
2018-11-02-02 18:00:00,19
2018-11-02-02 19:00:00,15
2018-11-02-02 20:00:00,11
2018-11-02-02 21:00:00,7
2018-11-02-02 22:00:00,3
2018-11-02-02 23:00:00,1
2018-11-02-03 00:00:00,1
2018-11-02-03 01:00:00,2
2018-11-02-03 02:00:00,3
2018-11-02-03 03:00:00,4
2018-11-02-03 04:00:00,5
2018-11-02-03 05:00:00,6
2018-11-02-03 06:00:00,7
2018-11-02-03 07:00:00,8
2018-11-02-03 08:00:00,9
2018-11-02-03 09:00:00,10
2018-11-02-03 10:00:00,11
2018-11-02-03 11:00:00,12
2018-11-02-03 12:00:00,13
2018-11-02-03 13:00:00,14
2018-11-02-03 14:00:00,15
2018-11-02-03 15:00:00,16
2018-11-02-03 16:00:00,17
2018-11-02-03 17:00:00,18
2018-11-02-03 18:00:00,19
2018-11-02-03 19:00:00,15
2018-11-02-03 20:00:00,11
2018-11-02-03 21:00:00,7
2018-11-02-03 22:00:00,3
2018-11-02-03 23:00:00,1
2018-11-02-04 00:00:00,1
2018-11-02-04 01:00:00,2
2018-11-02-04 02:00:00,3
2018-11-02-04 03:00:00,4
2018-11-02-04 04:00:00,5
2018-11-02-04 05:00:00,6
2018-11-02-04 06:00:00,7
2018-11-02-04 07:00:00,8
2018-11-02-04 08:00:00,9
2018-11-02-04 09:00:00,10
2018-11-02-04 10:00:00,11
2018-11-02-04 11:00:00,12
2018-11-02-04 12:00:00,4
2018-11-02-04 13:00:00,14
2018-11-02-04 14:00:00,15
2018-11-02-04 15:00:00,16
2018-11-02-04 16:00:00,17
2018-11-02-04 17:00:00,18
2018-11-02-04 18:00:00,19
2018-11-02-04 19:00:00,15
2018-11-02-04 20:00:00,11
2018-11-02-04 21:00:00,7
2018-11-02-04 22:00:00,3
2018-11-02-04 23:00:00,1
2018-11-02-05 00:00:00,1
2018-11-02-05 01:00:00,2
2018-11-02-05 02:00:00,3
2018-11-02-05 03:00:00,4
2018-11-02-05 04:00:00,5
2018-11-02-05 05:00:00,6
2018-11-02-05 06:00:00,7
2018-11-02-05 07:00:00,8
2018-11-02-05 08:00:00,9
2018-11-02-05 09:00:00,10
2018-11-02-05 10:00:00,11
2018-11-02-05 11:00:00,12
2018-11-02-05 12:00:00,13
2018-11-02-05 13:00:00,14
2018-11-02-05 14:00:00,15
2018-11-02-05 15:00:00,16
2018-11-02-05 16:00:00,17
2018-11-02-05 17:00:00,18
2018-11-02-05 18:00:00,19
2018-11-02-05 19:00:00,15
2018-11-02-05 20:00:00,11
2018-11-02-05 21:00:00,7
2018-11-02-05 22:00:00,3
2018-11-02-05 23:00:00,1
2018-11-02-06 00:00:00,1
2018-11-02-06 01:00:00,2
2018-11-02-06 02:00:00,3
2018-11-02-06 03:00:00,4
2018-11-02-06 04:00:00,5
2018-11-02-06 05:00:00,6
2018-11-02-06 06:00:00,7
2018-11-02-06 07:00:00,8
2018-11-02-06 08:00:00,9
2018-11-02-06 09:00:00,10
2018-11-02-06 10:00:00,11
2018-11-02-06 11:00:00,12
2018-11-02-06 12:00:00,13
2018-11-02-06 13:00:00,14
2018-11-02-06 14:00:00,15
2018-11-02-06 15:00:00,16
2018-11-02-06 16:00:00,17
2018-11-02-06 17:00:00,18
2018-11-02-06 18:00:00,19
2018-11-02-06 19:00:00,15
2018-11-02-06 20:00:00,8
2018-11-02-06 21:00:00,7
2018-11-02-06 22:00:00,3
2018-11-02-06 23:00:00,1