zoukankan      html  css  js  c++  java
  • 08-08 细分构建机器学习应用程序的流程-模型优化


    更新、更全的《机器学习》的更新网站,更有python、go、数据结构与算法、爬虫、人工智能教学等着你:https://www.cnblogs.com/nickchen121/p/11686958.html

    细分构建机器学习应用程序的流程-模型优化

    通过数据收集、数据预处理、训练模型、测试模型上述四个步骤,一般可以得到一个不错的模型,但是一般得到的都是一个参数收敛的模型,然而我们模型还有超参数或不同的核函数等,如r的非线性支持向量机的bf核或linear核;rbf核的非线性支持向量机超参数(C、gamma),正则化中的(alpha)。我们模型优化一块主要是对模型超参数的优化,简而言之就是输入一组超参数,对每个超参数对应的模型进行测试,选择这一组超参数中最优的模型。

    一、1.1 网格搜索法

    网格搜索法相当于对你输入的每一个参数都进行验证,并且可以设置多个超参数。

    from sklearn import datasets
    from sklearn.svm import SVC
    from sklearn.model_selection import ShuffleSplit
    from sklearn.model_selection import GridSearchCV
    
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    
    # 总共有2*4=8种选择
    parameters = {'kernel': ('linear', 'rbf'), 'C': [0.1, 1, 10, 100]}
    
    svc = SVC(gamma="scale")
    
    cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=1)
    
    scoring = 'accuracy'
    
    clf = GridSearchCV(svc, parameters, cv=cv, scoring=scoring)
    clf.fit(X, y)
    
    GridSearchCV(cv=ShuffleSplit(n_splits=10, random_state=1, test_size=0.3, train_size=None),
           error_score='raise-deprecating',
           estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False),
           fit_params=None, iid='warn', n_jobs=None,
           param_grid={'kernel': ('linear', 'rbf'), 'C': [0.1, 1, 10, 100]},
           pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
           scoring='accuracy', verbose=0)
    
    clf.cv_results_.keys()
    
    dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_kernel', 'param_gamma', 'param_C', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])
    
    clf.best_params_
    
    {'C': 1, 'kernel': 'linear'}
    
    clf.best_score_
    
    0.9844444444444445
    

    二、1.2 随机搜索法

    随机搜索法一般用于超参数过多的时候,即当一组超参数有上千上万个的时候,我们会通过随机搜索法随机选择一部分超参数,对模型调优。一般随机搜索法会通过sklearn.model_selection.ParameterSampler方法进行采样。

    2.1 1.2.1 随机采样

    from sklearn.model_selection import ParameterSampler
    from scipy.stats.distributions import expon
    import numpy as np
    
    np.random.seed(1)
    param_grid = {'a': [1, 2], 'b': expon()}
    # expon为指数分布,该分布取值为无数个,即param_grid有无数个参数
    param_grid
    
    {'a': [1, 2],
     'b': <scipy.stats._distn_infrastructure.rv_frozen at 0x1a1772a630>}
    
    # n_iter=4指定采样次数为4次
    param_list = list(ParameterSampler(param_grid, n_iter=5))
    
    rounded_list = [dict((k, round(v, 2)) for (k, v) in d.items())
                    for d in param_list]
    rounded_list
    
    [{'a': 2, 'b': 5.87},
     {'a': 1, 'b': 0.0},
     {'a': 2, 'b': 6.95},
     {'a': 1, 'b': 0.1},
     {'a': 1, 'b': 0.49}]
    

    2.2 1.2.2 随机搜索法

    from sklearn import datasets
    from sklearn.svm import SVC
    from sklearn.model_selection import ShuffleSplit
    from sklearn.model_selection import RandomizedSearchCV
    
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    
    # 总共有2*5*5=50种选择
    parameters = {'kernel': ('linear', 'rbf'), 'C': [
        0.1, 1, 10, 100, 1000], 'gamma': [0.1, 1, 10, 100, 1000]}
    svc = SVC()
    
    cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=1)
    
    scoring = 'accuracy'
    
    clf = RandomizedSearchCV(svc, parameters, cv=cv, scoring=scoring, n_iter=15)
    clf.fit(X, y)
    
    RandomizedSearchCV(cv=ShuffleSplit(n_splits=5, random_state=1, test_size=0.3, train_size=None),
              error_score='raise-deprecating',
              estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, probability=False, random_state=None,
      shrinking=True, tol=0.001, verbose=False),
              fit_params=None, iid='warn', n_iter=15, n_jobs=None,
              param_distributions={'kernel': ('linear', 'rbf'), 'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.1, 1, 10, 100, 1000]},
              pre_dispatch='2*n_jobs', random_state=None, refit=True,
              return_train_score='warn', scoring='accuracy', verbose=0)
    
    clf.cv_results_.keys()
    
    dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_kernel', 'param_gamma', 'param_C', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score'])
    
    clf.best_params_
    
    {'kernel': 'linear', 'gamma': 1000, 'C': 1}
    
    clf.best_score_
    
    0.9822222222222222
  • 相关阅读:
    Linux之硬件资源管理
    Linux之vi 文本编辑命令
    [2]窗口程序的创建
    [1]cs辅助项目分解
    1.3 Windows操作系统
    (PYG)学习去除软件自效验
    CSUST 4006-你真的会树套树套树吗?(贪心|dp)
    CSUST 4003-你真的会泡面吗?(优先队列模拟)
    2020牛客暑期多校第八场I-Interesting Computer Game(离散化+并查集)
    2020牛客暑期多校K-Kabaleo Lite(贪心)
  • 原文地址:https://www.cnblogs.com/nickchen121/p/11686725.html
Copyright © 2011-2022 走看看