zoukankan      html  css  js  c++  java
  • Three ways to detect outliers

    Z-score

    import numpy as np
    
    def outliers_z_score(ys):
        threshold = 3
    
        mean_y = np.mean(ys)
        stdev_y = np.std(ys)
        z_scores = [(y - mean_y) / stdev_y for y in ys]
        return np.where(np.abs(z_scores) > threshold)

    Modified Z-score

    import numpy as np
    
    def outliers_modified_z_score(ys):
        threshold = 3.5
    
        median_y = np.median(ys)
        median_absolute_deviation_y = np.median([np.abs(y - median_y) for y in ys])
        modified_z_scores = [0.6745 * (y - median_y) / median_absolute_deviation_y
                             for y in ys]
        return np.where(np.abs(modified_z_scores) > threshold)

    IQR(interquartile range)

    import numpy as np
    
    def outliers_iqr(ys):
        quartile_1, quartile_3 = np.percentile(ys, [25, 75])
        iqr = quartile_3 - quartile_1
        lower_bound = quartile_1 - (iqr * 1.5)
        upper_bound = quartile_3 + (iqr * 1.5)
        return np.where((ys > upper_bound) | (ys < lower_bound))
    

    Conclusion

    It is important to reiterate that these methods should not be used mechanically. 
    They should be used to explore the data – they let you know which points might be worth a closer look. 
    What to do with this information depends heavily on the situation. 
    Sometimes it is appropriate to exclude outliers from a dataset to make a model trained on that dataset more predictive. 
    Sometimes, however, 
    the presence of outliers is a warning sign that the real-world process generating the data is more complicated than expected.
    
    As an astute commenter on CrossValidated put it: 
    “Sometimes outliers are bad data, and should be excluded, such as typos.
    Sometimes they are Wayne Gretzky or Michael Jordan, and should be kept.” 
    
    Domain knowledge and practical wisdom are the only ways to tell the difference.
    

      

    摘自:http://colingorrie.github.io/outlier-detection.html

  • 相关阅读:
    第五章 Python——字符编码与文件处理
    第六章 Python——函数与面向过程编程
    第七章 Python——模块与包
    第一章 计算机硬件基础与操作系统介绍
    luogu P1706 全排列问题
    luogu 2142 高精度减法
    luogu P1601 高精度加法
    luogu P1803 线段覆盖 贪心
    luogu P1031 均分纸牌 贪心
    luogu P2678 跳石头 二分答案
  • 原文地址:https://www.cnblogs.com/standby/p/9403999.html
Copyright © 2011-2022 走看看