zoukankan      html  css  js  c++  java
  • 异常检测用几种方法

    在污染的数量已知的情况下,下面的例子介绍了执行野点和异常检测的两种不同方式:

    • 基于协方差的稳健估计,假设数据是高斯分布的,那么在这样的案例中执行效果将优于One-Class SVM;
    • 利用One-Class SVM,它有能力捕获数据集的形状,因此对于强非高斯数据有更加优秀的效果,例如两个截然分开的数据集;

    正常值和异常值的真实状况是由点的颜色而定的,橙色填充的区域则表示这部分点被对应的方法标记为异常值。

    这里我们假定,我们知道数据集中一部分的异常值。由此我们通过对decision_function设置阈值来分离出相应的部分,而不是使用'预测'方法。../../_images/plot_outlier_detection_3.png../../_images/plot_outlier_detection_1.png../../_images/plot_outlier_detection_2.png

     1 """
     2 ==========================================
     3 Outlier detection with several methods.
     4 ==========================================
     5 
     6 When the amount of contamination is known, this example illustrates two
     7 different ways of performing :ref:`outlier_detection`:
     8 
     9 - based on a robust estimator of covariance, which is assuming that the
    10   data are Gaussian distributed and performs better than the One-Class SVM
    11   in that case.
    12 
    13 - using the One-Class SVM and its ability to capture the shape of the
    14   data set, hence performing better when the data is strongly
    15   non-Gaussian, i.e. with two well-separated clusters;
    16 
    17 The ground truth about inliers and outliers is given by the points colors
    18 while the orange-filled area indicates which points are reported as outliers
    19 by each method.
    20 
    21 Here, we assume that we know the fraction of outliers in the datasets.
    22 Thus rather than using the 'predict' method of the objects, we set the
    23 threshold on the decision_function to separate out the corresponding
    24 fraction.
    25 """
    26 print(__doc__)
    27 
    28 import numpy as np
    29 import pylab as pl
    30 import matplotlib.font_manager
    31 from scipy import stats
    32 
    33 from sklearn import svm
    34 from sklearn.covariance import EllipticEnvelope
    35 
    36 # Example settings
    37 n_samples = 200
    38 outliers_fraction = 0.25
    39 clusters_separation = [0, 1, 2]
    40 
    41 # define two outlier detection tools to be compared
    42 classifiers = {
    43     "One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
    44                                      kernel="rbf", gamma=0.1),
    45     "robust covariance estimator": EllipticEnvelope(contamination=.1)}
    46 
    47 # Compare given classifiers under given settings
    48 xx, yy = np.meshgrid(np.linspace(-7, 7, 500), np.linspace(-7, 7, 500))
    49 n_inliers = int((1. - outliers_fraction) * n_samples)
    50 n_outliers = int(outliers_fraction * n_samples)
    51 ground_truth = np.ones(n_samples, dtype=int)
    52 ground_truth[-n_outliers:] = 0
    53 
    54 # Fit the problem with varying cluster separation
    55 for i, offset in enumerate(clusters_separation):
    56     np.random.seed(42)
    57     # Data generation
    58     X1 = 0.3 * np.random.randn(0.5 * n_inliers, 2) - offset
    59     X2 = 0.3 * np.random.randn(0.5 * n_inliers, 2) + offset
    60     X = np.r_[X1, X2]
    61     # Add outliers
    62     X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
    63 
    64     # Fit the model with the One-Class SVM
    65     pl.figure(figsize=(10, 5))
    66     for i, (clf_name, clf) in enumerate(classifiers.iteritems()):
    67         # fit the data and tag outliers
    68         clf.fit(X)
    69         y_pred = clf.decision_function(X).ravel()
    70         threshold = stats.scoreatpercentile(y_pred,
    71                                             100 * outliers_fraction)
    72         y_pred = y_pred > threshold
    73         n_errors = (y_pred != ground_truth).sum()
    74         # plot the levels lines and the points
    75         Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    76         Z = Z.reshape(xx.shape)
    77         subplot = pl.subplot(1, 2, i + 1)
    78         subplot.set_title("Outlier detection")
    79         subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
    80                          cmap=pl.cm.Blues_r)
    81         a = subplot.contour(xx, yy, Z, levels=[threshold],
    82                             linewidths=2, colors='red')
    83         subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],
    84                          colors='orange')
    85         b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white')
    86         c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black')
    87         subplot.axis('tight')
    88         subplot.legend(
    89             [a.collections[0], b, c],
    90             ['learned decision function', 'true inliers', 'true outliers'],
    91             prop=matplotlib.font_manager.FontProperties(size=11))
    92         subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_errors))
    93         subplot.set_xlim((-7, 7))
    94         subplot.set_ylim((-7, 7))
    95     pl.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
    96 
    97 pl.show()

    Total running time of the example: 2.13 seconds

  • 相关阅读:
    《神经网络的梯度推导与代码验证》之vanilla RNN的前向传播和反向梯度推导
    《神经网络的梯度推导与代码验证》之CNN(卷积神经网络)前向和反向传播过程的代码验证
    《神经网络的梯度推导与代码验证》之CNN(卷积神经网络)的前向传播和反向梯度推导
    《神经网络的梯度推导与代码验证》之FNN(DNN)前向和反向传播过程的代码验证
    《神经网络的梯度推导与代码验证》之FNN(DNN)的前向传播和反向梯度推导
    《神经网络的梯度推导与代码验证》之数学基础篇:矩阵微分与求导
    《神经网络的梯度推导与代码验证》系列介绍
    手把手撸套框架-关于2.0的一些畅想
    手把手撸套框架-Victory.Core工具集
    手把手撸套框架-Victory框架1.1 详解
  • 原文地址:https://www.cnblogs.com/Gihub/p/3828940.html
Copyright © 2011-2022 走看看