zoukankan      html  css  js  c++  java
  • 机器学习入门-随机森林预测温度-不同参数对结果的影响调参 1.RandomedSearchCV(随机参数组的选择) 2.GridSearchCV(网格参数搜索) 3.pprint(顺序打印) 4.rf.get_params(获得当前的输入参数)

    使用了RamdomedSearchCV迭代100次,从参数组里面选择出当前最佳的参数组合

    在RamdomedSearchCV的基础上,使用GridSearchCV在上面最佳参数的周围选择一些合适的参数组合,进行参数的微调

    1.  RandomedSearchCV(estimator=rf, param_distributions=param_random, cv=3, verbose=2,random_state=42, n_iter=100) # 随机选择参数组合

    参数说明:estimator使用的模型, param_distributions表示待选的参数组合,cv表示交叉验证的次数,verbose表示打印的详细程度,random_state表示随机种子, n_iter迭代的次数

    2.GridSearchCV(estimator = rf, param_grid=grid_param, cv=3, verbose=2)

    参数说明:estimator使用的模型, param_grid 待选择的参数组合, cv交叉验证的次数,verbose打印的详细程度

    3. pprint(rf.get_params())

    参数说明:pprint按顺序进行打印, rf.get_params() 表示获得随机森林模型的当前输入参数

    代码:

    第一步:导入数据

    第二步:对数据的文本标签进行one-hot编码

    第三步:提取特征和标签

    第四步:使用train_test_split将数据分为训练集和测试集

    第五步:构建随机森林训练集进行训练

    第六步:获得模型特征重要性进行排序,选取前5重要性的特征rf.feature_importances_ 

    第七步:重新构建随机森林的模型

    第八步:使用RandomedSearchCV() 进行参数组的随机选择

    第九步:根据获得的参数组,使用GridSearchCV() 进行参数组附近的选择,从而对参数组进行微调

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestRegressor
    import time
    
    # 第一步读取数据
    data = pd.read_csv('data/temps_extended.csv')
    # 第二步:对文本标签使用one-hot编码
    data = pd.get_dummies(data)
    # 第三步:提取特征和标签
    X = data.drop('actual', axis=1)
    feature_names = np.array(X.columns)
    y = np.array(data['actual'])
    X = np.array(X)
    # 第四步:使用train_test_split进行样本的拆分
    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # 第五步:建立模型和预测
    rf = RandomForestRegressor(random_state=42, n_estimators=1000)
    rf.fit(train_x, train_y)
    pre_y = rf.predict(test_x)
    # MSE
    mse = round(abs(pre_y - test_y).mean(), 2)
    error = abs(pre_y - test_y).mean()
    # MAPE
    mape = round(((1 - abs(pre_y - test_y) / test_y)*100).mean(), 2)
    print(mse, mape)
    
    # 第六步:选取特征重要性加和达到95%的特征
    # 获得特征重要性的得分
    feature_importances = rf.feature_importances_
    # 将特征重要性得分和特征名进行组合
    feature_importances_names = [(feature_name, feature_importance) for feature_name, feature_importance in
                                 zip(feature_names, feature_importances)]
    # 对特征重要性进行按照特征得分进行排序
    feature_importances_names = sorted(feature_importances_names, key=lambda x: x[1], reverse=True)
    # 获得排序后的特征名
    feature_importances_n = [x[0] for x in feature_importances_names]
    # 获得排序后的特征重要性得分
    feature_importances_v = [x[1] for x in feature_importances_names]
    feature_importances_v_add = np.cumsum(feature_importances_v)
    little_feature_name = feature_importances_n[:np.where([feature_importances_v_add > 0.95])[1][0]+1]
    
    # 第七步:选择重要性前5的特征重新建立模型
    X = data[little_feature_name].values
    y = data['actual'].values
    
    # 使用train_test_split进行样本的拆分
    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=42)
    rf = RandomForestRegressor(random_state=42, n_estimators=1000)
    
    # 第八步:使用RandomizedSearchCV随机选择参数组合
    
    # 使用pprint打印rf的参数
    from pprint import pprint
    pprint(rf.get_params())
    
    from sklearn.model_selection import RandomizedSearchCV
    #树的个数
    n_estimators = [int(x) for x in range(200, 2000, 100)]
    min_samples_leaf = [2, 4, 6]
    min_samples_split = [1, 2, 4]
    max_features = ['auto', 'sqrt']
    bootstrap = [True, False]
    max_depth = [int(x) for x in range(10, 100, 10)]
    param_random = {
        'n_estimators': n_estimators,
        'max_depth': max_depth,
        'max_features': max_features,
        'min_samples_leaf': min_samples_leaf,
        'min_samples_split': min_samples_split,
        'bootstrap': bootstrap
    }
    
    rf = RandomForestRegressor()
    rf_random = RandomizedSearchCV(estimator=rf, param_distributions=param_random, cv=3, verbose=2,
                                   random_state=42)
    rf_random.fit(train_x, train_y)
    # 获得最好的训练模型
    best_estimator = rf_random.best_estimator_
    # 定义用于计算误差和准确度的函数
    def Calculation_accuracy(estimator, test_x, test_y):
        pre_y = estimator.predict(test_x)
        error = abs(pre_y - test_y).mean()
        accuraccy = ((1 - abs(pre_y - test_y)/test_y)*100).mean()
        return error, accuraccy
    # 计算损失值和准确度
    error, accuraccy = Calculation_accuracy(best_estimator, test_x, test_y)
    print(error, accuraccy)
    # 打印最好的参数组合
    print(rf_random.best_params_)
    # 最好的参数组合 {'n_estimators': 800, 'min_samples_split': 4, 'min_samples_leaf': 4, 'max_features': 'auto',
    # 'max_depth': 10, 'bootstrap': 'True'}
    
    # 第九步:根据RandomizedSearchCV获得参数,使用GridSearchCV进行参数的微调
    from sklearn.model_selection import GridSearchCV
    
    n_estimators = [600, 800, 1000]
    min_samples_split = [4]
    min_samples_leaf = [4]
    max_depth = [8, 10, 12]
    grid_param = {
        'n_estimators': n_estimators,
        'min_samples_split': min_samples_split,
        'min_samples_leaf': min_samples_leaf,
        'max_depth': max_depth
    }
    rf = RandomForestRegressor()
    rf_grid = GridSearchCV(rf, param_grid=grid_param, cv=3, verbose=2)
    rf_grid.fit(train_x, train_y)
    best_estimator = rf_grid.best_estimator_
    error, accuraccy = Calculation_accuracy(best_estimator, test_x, test_y)
    print(error, accuraccy)
    print(rf_grid.best_params_)
    # {'max_depth': 8, 'min_samples_leaf': 4, 'min_samples_split': 4, 'n_estimators': 1000}
  • 相关阅读:
    韩式英语
    Daily dictation 听课笔记
    words with same pronunciation
    you will need to restart eclipse for the changes to take effect. would you like to restart now?
    glottal stop(britain fountain mountain)
    education 的发音
    第一次用Matlab 的lamada语句
    SVN的switch命令
    String的split
    SVN模型仓库中的资源从一个地方移动到另一个地方的办法(很久才解决)
  • 原文地址:https://www.cnblogs.com/my-love-is-python/p/10316478.html
Copyright © 2011-2022 走看看