zoukankan      html  css  js  c++  java
  • sklearn 中的交叉验证

    sklearn中的交叉验证(Cross-Validation)

    sklearn是利用python进行机器学习中一个非常全面和好用的第三方库,用过的都说好。今天主要记录一下sklearn中关于交叉验证的各种用法,主要是对sklearn官方文档 Cross-validation: evaluating estimator performance进行讲解,英文水平好的建议读官方文档,里面的知识点很详细。

    1. cross_val_score
    对数据集进行指定次数的交叉验证并为每次验证效果评测
    其中,score 默认是以 scoring=’f1_macro’进行评测的,余外针对分类或回归还有:

    这需要from sklearn import metrics ,通过在cross_val_score 指定参数来设定评测标准;
    当cv 指定为int 类型时,默认使用KFold 或StratifiedKFold 进行数据集打乱,下面会对KFold 和StratifiedKFold 进行介绍。

    In [15]: from sklearn.model_selection import cross_val_score
    
    In [16]: clf = svm.SVC(kernel='linear', C=1)
    
    In [17]: scores = cross_val_score(clf, iris.data, iris.target, cv=5)
    
    In [18]: scores
    Out[18]: array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ])
    
    In [19]: scores.mean()
    Out[19]: 0.98000000000000009
    

      

    除使用默认交叉验证方式外,可以对交叉验证方式进行指定,如验证次数,训练集测试集划分比例等

    In [20]: from sklearn.model_selection import ShuffleSplit
    
    In [21]: n_samples = iris.data.shape[0]
    
    In [22]: cv = ShuffleSplit(n_splits=3, test_size=.3, random_state=0)
    
    In [23]: cross_val_score(clf, iris.data, iris.target, cv=cv)
    Out[23]: array([ 0.97777778,  0.97777778,  1.        ])
    

      

    2. cross_val_predict
    cross_val_predict 与cross_val_score 很相像,不过不同于返回的是评测效果,cross_val_predict 返回的是estimator 的分类结果(或回归值),这个对于后期模型的改善很重要,可以通过该预测输出对比实际目标值,准确定位到预测出错的地方,为我们参数优化及问题排查十分的重要。

    In [28]: from sklearn.model_selection import cross_val_predict
    
    In [29]: from sklearn import metrics
    
    In [30]: predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
    
    In [31]: predicted
    Out[31]: 
    array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
           0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,
           2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
    
    In [32]: metrics.accuracy_score(iris.target, predicted)
    Out[32]: 0.96666666666666667
    

      

    3. KFold

    K折交叉验证,这是将数据集分成K份的官方给定方案,所谓K折就是将数据集通过K次分割,使得所有数据既在训练集出现过,又在测试集出现过,当然,每次分割中不会有重叠。相当于无放回抽样。

    In [33]: from sklearn.model_selection import KFold
    
    In [34]: X = ['a','b','c','d']
    
    In [35]: kf = KFold(n_splits=2)
    
    In [36]: for train, test in kf.split(X):
        ...:     print train, test
        ...:     print np.array(X)[train], np.array(X)[test]
        ...:     print '
    '
        ...:     
    [2 3] [0 1]
    ['c' 'd'] ['a' 'b']
    
    
    [0 1] [2 3]
    ['a' 'b'] ['c' 'd']
    

      

    4. LeaveOneOut
    LeaveOneOut 其实就是KFold 的一个特例,因为使用次数比较多,因此独立的定义出来,完全可以通过KFold 实现。

    In [37]: from sklearn.model_selection import LeaveOneOut
    
    In [38]: X = [1,2,3,4]
    
    In [39]: loo = LeaveOneOut()
    
    In [41]: for train, test in loo.split(X):
        ...:     print train, test
        ...:     
    [1 2 3] [0]
    [0 2 3] [1]
    [0 1 3] [2]
    [0 1 2] [3]
    
    
    #使用KFold实现LeaveOneOtut
    In [42]: kf = KFold(n_splits=len(X))
    
    In [43]: for train, test in kf.split(X):
        ...:     print train, test
        ...:     
    [1 2 3] [0]
    [0 2 3] [1]
    [0 1 3] [2]
    [0 1 2] [3]
    

      

    5. LeavePOut
    这个也是KFold 的一个特例,用KFold 实现起来稍麻烦些,跟LeaveOneOut 也很像。

    In [44]: from sklearn.model_selection import LeavePOut
    
    In [45]: X = np.ones(4)
    
    In [46]: lpo = LeavePOut(p=2)
    
    In [47]: for train, test in lpo.split(X):
        ...:     print train, test
        ...:     
    [2 3] [0 1]
    [1 3] [0 2]
    [1 2] [0 3]
    [0 3] [1 2]
    [0 2] [1 3]
    [0 1] [2 3]
    

      

    6. ShuffleSplit
    ShuffleSplit 咋一看用法跟LeavePOut 很像,其实两者完全不一样,LeavePOut 是使得数据集经过数次分割后,所有的测试集出现的元素的集合即是完整的数据集,即无放回的抽样,而ShuffleSplit 则是有放回的抽样,只能说经过一个足够大的抽样次数后,保证测试集出现了完成的数据集的倍数。

    In [48]: from sklearn.model_selection import ShuffleSplit
    
    In [49]: X = np.arange(5)
    
    In [50]: ss = ShuffleSplit(n_splits=3, test_size=.25, random_state=0)
    
    In [51]: for train_index, test_index in ss.split(X):
        ...:     print train_index, test_index
        ...:     
    [1 3 4] [2 0]
    [1 4 3] [0 2]
    [4 0 2] [1 3]
    

      

    7. StratifiedKFold

    对测试集合进行无放回抽样

    In [52]: from sklearn.model_selection import StratifiedKFold
    
    In [53]: X = np.ones(10)
    
    In [54]: y = [0,0,0,0,1,1,1,1,1,1]
    
    In [55]: skf = StratifiedKFold(n_splits=3)
    
    In [56]: for train, test in skf.split(X,y):
        ...:     print train, test
        ...:     
    [2 3 6 7 8 9] [0 1 4 5]
    [0 1 3 4 5 8 9] [2 6 7]
    [0 1 2 4 5 6 7] [3 8 9]
    

      

     如果各个类的分布不均衡的话,使用micro F1score比macro F1score 比较好,显然macro F1score没有考虑各个类的数量大小


    原文:https://blog.csdn.net/xiaodongxiexie/article/details/71915259

  • 相关阅读:
    8-6实战蒙版
    8-5渐变及半透明蒙版
    8-4修改蒙版
    8-3建立蒙版
    imageNamed、imageWithContentsOfFile、imageWithData
    #import、#include、@class、@protocol、@interface
    JSON解析
    控制器的生命周期
    纯代码方式实现九宫格布局
    KVC笔记
  • 原文地址:https://www.cnblogs.com/Allen-rg/p/9901530.html
Copyright © 2011-2022 走看看