zoukankan      html  css  js  c++  java
  • 【scikit-learn】网格搜索来进行高效的参数调优

    内容概要

    • 如何使用K折交叉验证来搜索最优调节参数
    • 如何让搜索参数的流程更加高效
    • 如何一次性的搜索多个调节参数
    • 在进行真正的预测之前,如何对调节参数进行处理
    • 如何削减该过程的计算代价
     

    1. K折交叉验证回顾

    交叉验证的过程

    • 选择K的值(一般是10),将数据集分成K等份
    • 使用其中的K-1份数据作为训练数据,另外一份数据作为测试数据,进行模型的训练
    • 使用一种度量测度来衡量模型的预测性能

    交叉验证的优点

    • 交叉验证通过降低模型在一次数据分割中性能表现上的方差来保证模型性能的稳定性
    • 交叉验证可以用于选择调节参数、比较模型性能差别、选择特征

    交叉验证的缺点

    • 交叉验证带来一定的计算代价,尤其是当数据集很大的时候,导致计算过程会变得很慢
     

    2. 使用GridSearchCV进行高效调参

    GridSearchCV根据你给定的模型自动进行交叉验证,通过调节每一个参数来跟踪评分结果,实际上,该过程代替了进行参数搜索时的for循环过程。

    In [1]:
    from sklearn.datasets import load_iris
    from sklearn.neighbors import KNeighborsClassifier
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    from sklearn.grid_search import GridSearchCV
    
    In [2]:
    # read in the iris data
    iris = load_iris()
    
    # create X (features) and y (response)
    X = iris.data
    y = iris.target
    
    In [3]:
    # define the parameter values that should be searched
    k_range = range(1, 31)
    print k_range
    
     
    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
    
    In [4]:
    # create a parameter grid: map the parameter names to the values that should be searched
    # 下面是构建parameter grid,其结构是key为参数名称,value是待搜索的数值列表的一个字典结构
    param_grid = dict(n_neighbors=k_range)
    print param_grid
    
     
    {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}
    
    In [5]:
    knn = KNeighborsClassifier(n_neighbors=5)
    # instantiate the grid
    # 这里GridSearchCV的参数形式和cross_val_score的形式差不多,其中param_grid是parameter grid所对应的参数
    # GridSearchCV中的n_jobs设置为-1时,可以实现并行计算(如果你的电脑支持的情况下)
    grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
    
     

    我们可以知道,这里的grid search针对每个参数进行了10次交叉验证,并且一共对30个参数进行相同过程的交叉验证

    In [6]:
    grid.fit(X, y)
    
    Out[6]:
    GridSearchCV(cv=10, error_score='raise',
           estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
               metric_params=None, n_neighbors=5, p=2, weights='uniform'),
           fit_params={}, iid=True, loss_func=None, n_jobs=1,
           param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]},
           pre_dispatch='2*n_jobs', refit=True, score_func=None,
           scoring='accuracy', verbose=0)
    In [7]:
    # view the complete results (list of named tuples)
    grid.grid_scores_
    
    Out[7]:
    [mean: 0.96000, std: 0.05333, params: {'n_neighbors': 1},
     mean: 0.95333, std: 0.05207, params: {'n_neighbors': 2},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 3},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 4},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 5},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 6},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 7},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 8},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 9},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 10},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 11},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 12},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 13},
     mean: 0.97333, std: 0.04422, params: {'n_neighbors': 14},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 15},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 16},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 17},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 18},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 19},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 20},
     mean: 0.96667, std: 0.03333, params: {'n_neighbors': 21},
     mean: 0.96667, std: 0.03333, params: {'n_neighbors': 22},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 23},
     mean: 0.96000, std: 0.04422, params: {'n_neighbors': 24},
     mean: 0.96667, std: 0.03333, params: {'n_neighbors': 25},
     mean: 0.96000, std: 0.04422, params: {'n_neighbors': 26},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 27},
     mean: 0.95333, std: 0.04269, params: {'n_neighbors': 28},
     mean: 0.95333, std: 0.04269, params: {'n_neighbors': 29},
     mean: 0.95333, std: 0.04269, params: {'n_neighbors': 30}]
    In [8]:
    # examine the first tuple
    print grid.grid_scores_[0].parameters
    print grid.grid_scores_[0].cv_validation_scores
    print grid.grid_scores_[0].mean_validation_score
    
     
    {'n_neighbors': 1}
    [ 1.          0.93333333  1.          0.93333333  0.86666667  1.
      0.86666667  1.          1.          1.        ]
    0.96
    
    In [9]:
    # create a list of the mean scores only
    grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]
    print grid_mean_scores
    
     
    [0.95999999999999996, 0.95333333333333337, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.96666666666666667, 0.97333333333333338, 0.96666666666666667, 0.96666666666666667, 0.97333333333333338, 0.97999999999999998, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.97333333333333338, 0.97999999999999998, 0.97333333333333338, 0.97999999999999998, 0.96666666666666667, 0.96666666666666667, 0.97333333333333338, 0.95999999999999996, 0.96666666666666667, 0.95999999999999996, 0.96666666666666667, 0.95333333333333337, 0.95333333333333337, 0.95333333333333337]
    
    In [10]:
    # plot the results
    plt.plot(k_range, grid_mean_scores)
    plt.xlabel('Value of K for KNN')
    plt.ylabel('Cross-Validated Accuracy')
    
    Out[10]:
    <matplotlib.text.Text at 0x6e34090>
     
    In [11]:
    # examine the best model
    print grid.best_score_
    print grid.best_params_
    print grid.best_estimator_
    
     
    0.98
    {'n_neighbors': 13}
    KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
               metric_params=None, n_neighbors=13, p=2, weights='uniform')
    
     

    3. 同时对多个参数进行搜索

    这里我们使用knn的两个参数,分别是n_neighbors和weights,其中weights参数默认是uniform,该参数将所有数据看成等同的,而另一值是distance,它将近邻的数据赋予更高的权重,而较远的数据赋予较低权重。

    In [12]:
    # define the parameter values that should be searched
    k_range = range(1, 31)
    weight_options = ['uniform', 'distance']
    
    In [13]:
    # create a parameter grid: map the parameter names to the values that should be searched
    param_grid = dict(n_neighbors=k_range, weights=weight_options)
    print param_grid
    
     
    {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'weights': ['uniform', 'distance']}
    
    In [14]:
    # instantiate and fit the grid
    grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
    grid.fit(X, y)
    
    Out[14]:
    GridSearchCV(cv=10, error_score='raise',
           estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
               metric_params=None, n_neighbors=5, p=2, weights='uniform'),
           fit_params={}, iid=True, loss_func=None, n_jobs=1,
           param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30], 'weights': ['uniform', 'distance']},
           pre_dispatch='2*n_jobs', refit=True, score_func=None,
           scoring='accuracy', verbose=0)
    In [15]:
    # view the complete results
    grid.grid_scores_
    
    Out[15]:
    [mean: 0.96000, std: 0.05333, params: {'n_neighbors': 1, 'weights': 'uniform'},
     mean: 0.96000, std: 0.05333, params: {'n_neighbors': 1, 'weights': 'distance'},
     mean: 0.95333, std: 0.05207, params: {'n_neighbors': 2, 'weights': 'uniform'},
     mean: 0.96000, std: 0.05333, params: {'n_neighbors': 2, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 3, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 3, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 4, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 4, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 5, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 5, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 6, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 6, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 7, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 7, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 8, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 8, 'weights': 'distance'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 9, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 9, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 10, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 10, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 11, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 11, 'weights': 'distance'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 12, 'weights': 'uniform'},
     mean: 0.97333, std: 0.04422, params: {'n_neighbors': 12, 'weights': 'distance'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 13, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 13, 'weights': 'distance'},
     mean: 0.97333, std: 0.04422, params: {'n_neighbors': 14, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 14, 'weights': 'distance'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 15, 'weights': 'uniform'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 15, 'weights': 'distance'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 16, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 16, 'weights': 'distance'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 17, 'weights': 'uniform'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 17, 'weights': 'distance'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 18, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 18, 'weights': 'distance'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 19, 'weights': 'uniform'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 19, 'weights': 'distance'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 20, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 20, 'weights': 'distance'},
     mean: 0.96667, std: 0.03333, params: {'n_neighbors': 21, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 21, 'weights': 'distance'},
     mean: 0.96667, std: 0.03333, params: {'n_neighbors': 22, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 22, 'weights': 'distance'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 23, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 23, 'weights': 'distance'},
     mean: 0.96000, std: 0.04422, params: {'n_neighbors': 24, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 24, 'weights': 'distance'},
     mean: 0.96667, std: 0.03333, params: {'n_neighbors': 25, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 25, 'weights': 'distance'},
     mean: 0.96000, std: 0.04422, params: {'n_neighbors': 26, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 26, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 27, 'weights': 'uniform'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 27, 'weights': 'distance'},
     mean: 0.95333, std: 0.04269, params: {'n_neighbors': 28, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 28, 'weights': 'distance'},
     mean: 0.95333, std: 0.04269, params: {'n_neighbors': 29, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 29, 'weights': 'distance'},
     mean: 0.95333, std: 0.04269, params: {'n_neighbors': 30, 'weights': 'uniform'},
     mean: 0.96667, std: 0.03333, params: {'n_neighbors': 30, 'weights': 'distance'}]
    In [16]:
    # examine the best model
    print grid.best_score_
    print grid.best_params_
    
     
    0.98
    {'n_neighbors': 13, 'weights': 'uniform'}
    
     

    4. 使用最佳参数做出预测

    In [17]:
    # train your model using all data and the best known parameters
    knn = KNeighborsClassifier(n_neighbors=13, weights='uniform')
    knn.fit(X, y)
    
    # make a prediction on out-of-sample data
    knn.predict([3, 5, 4, 2])
    
    Out[17]:
    array([1])
     

    这里使用之前得到的最佳参数对模型进行重新训练,在训练时,就可以将所有的数据都作为训练数据全部投入到模型中去,这样就不会浪费个别数据了。

    In [18]:
    # shortcut: GridSearchCV automatically refits the best model using all of the data
    grid.predict([3, 5, 4, 2])
    
    Out[18]:
    array([1])
     

    5. 使用RandomizeSearchCV来降低计算代价

    • RandomizeSearchCV用于解决多个参数的搜索过程中计算代价过高的问题
    • RandomizeSearchCV搜索参数中的一个子集,这样你可以控制计算代价 
    In [19]:
    from sklearn.grid_search import RandomizedSearchCV
    
    In [20]:
    # specify "parameter distributions" rather than a "parameter grid"
    param_dist = dict(n_neighbors=k_range, weights=weight_options)
    
    In [21]:
    # n_iter controls the number of searches
    rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)
    rand.fit(X, y)
    rand.grid_scores_
    
    Out[21]:
    [mean: 0.97333, std: 0.03266, params: {'n_neighbors': 18, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 8, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 24, 'weights': 'distance'},
     mean: 0.98000, std: 0.03055, params: {'n_neighbors': 20, 'weights': 'uniform'},
     mean: 0.95333, std: 0.04269, params: {'n_neighbors': 28, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 9, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 5, 'weights': 'distance'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 5, 'weights': 'uniform'},
     mean: 0.97333, std: 0.03266, params: {'n_neighbors': 19, 'weights': 'uniform'},
     mean: 0.96667, std: 0.04472, params: {'n_neighbors': 20, 'weights': 'distance'}]
    In [22]:
    # examine the best model
    print rand.best_score_
    print rand.best_params_
    
     
    0.98
    {'n_neighbors': 20, 'weights': 'uniform'}
    
    In [23]:
    # run RandomizedSearchCV 20 times (with n_iter=10) and record the best score
    best_scores = []
    for _ in range(20):
        rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10)
        rand.fit(X, y)
        best_scores.append(round(rand.best_score_, 3))
    print best_scores
    
     
    [0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.98, 0.98, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.98, 0.973, 0.98, 0.98, 0.973, 0.973]
    
     

    当你的调节参数是连续的,比如回归问题的正则化参数,有必要指定一个连续分布而不是可能值的列表,这样RandomizeSearchCV就可以执行更好的grid search。

     

    参考资料

    转:http://blog.csdn.net/jasonding1354/article/details/50562522

  • 相关阅读:
    网络日志流量分析-第一部分.doc
    Azkaban.Sqoop_网站流量日志分析2
    飞机加油问题
    9个点画10条直线,要求每条直线上至少3个点
    vector
    Selenium VS Webdriver
    B/S测试与C/S测试之区别
    几款代码比较工具
    单元测试-圈复杂度计算
    为什么并行测试很困难以及如何使用 ConTest 辅助测试
  • 原文地址:https://www.cnblogs.com/jojo123/p/6894616.html
Copyright © 2011-2022 走看看