zoukankan      html  css  js  c++  java
  • 模型融合---Stacking调参总结

    1. 回归

    训练了两个回归器,GBDT和Xgboost,用这两个回归器做stacking

    使用之前已经调好参的训练器

    gbdt_nxf = GradientBoostingRegressor(learning_rate=0.06,n_estimators=250,
                                      min_samples_split=700,min_samples_leaf=70,max_depth=6,
                                      max_features='sqrt',subsample=0.8,random_state=75)
    xgb_nxf = XGBRegressor(learning_rate=0.06,max_depth=6,n_estimators=200,random_state=75)
    

      

    事先建好stacking要用到的矩阵

    from sklearn.model_selection import KFold,StratifiedKFold
    kf = StratifiedKFold(n_splits=5,random_state=75,shuffle=True)
    
    from sklearn.metrics import r2_score
    
    train_proba = np.zeros((len(gbdt_train_data),2))
    train_proba = pd.DataFrame(train_proba)
    train_proba.columns = ['gbdt_nxf','xgb_nxf']
    
    test_proba = np.zeros((len(gbdt_test_data),2))
    test_proba = pd.DataFrame(test_proba)
    test_proba.columns = ['gbdt_nxf','xgb_nxf']
    

      

    reg_names = ['gbdt_nxf','xgb_nxf']
    
    for i,reg in enumerate([gbdt_nxf,xgb_nxf]):
        pred_list = []
        col = reg_names[i]
        for train_index,val_index in kf.split(gbdt_train_data,gbdt_train_label):
            x_train = gbdt_train_data.loc[train_index,:].values
            y_train = gbdt_train_label[train_index]
            x_val = gbdt_train_data.loc[val_index,:].values
            y_val = gbdt_train_label[val_index]
            
            reg.fit(x_train,y_train)
            y_vali = reg.predict(x_val)
            train_proba.loc[val_index,col] = y_vali
            print('%s cv r2 %s'%(col,r2_score(y_val,y_vali)))
            
            y_testi = reg.predict(gbdt_test_data.values)
            pred_list.append(y_testi)
        test_proba.loc[:,col] = np.mean(np.array(pred_list),axis=0)
    

    r2值最高为0.79753,效果还不是特别的好

    然后用五折交叉验证,每折都预测整个测试集,得到五个预测的结果,求平均,就是新的预测集;而训练集就是五折中任意四折预测该折的训练集得到的标签的集合

    因为有两个训练器,GBDT和Xgboost,所以我们得到了两列的train_proba

    最后对新的训练集和测试集做回归,得到我们的结果

    #使用逻辑回归做stacking
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    scalar = StandardScaler()
    # train_proba = train_proba.values
    # test_proba = test_proba.values
    
    scalar.fit(train_proba)
    train_proba = scalar.transform(train_proba)
    test_proba = scalar.transform(test_proba)
    
    lr = LogisticRegression(tol=0.0001,C=0.5,random_state=24,max_iter=10)
    
    kf = StratifiedKFold(n_splits=5,random_state=75,shuffle=True)
    r2_list = []
    pred_list = []
    for train_index,val_index in kf.split(train_proba,gbdt_train_label):#训练集的标签还是一开始真实的训练集的标签
            x_train = train_proba[train_index]
            y_train = gbdt_train_label[train_index]
            x_val = train_proba[val_index]
            y_val = gbdt_train_label[val_index]
            
            lr.fit(x_train,y_train)
            y_vali = lr.predict(x_val)
            print('lr stacking cv r2 %s'%(r2_score(y_val,y_vali)))
            
            r2_list.append(r2_score(y_val,y_vali))
            
            y_testi = lr.predict(test_proba)
            pred_list.append(y_testi)
    
    print(lr.coef_,lr.n_iter_)#过拟合很严重

    2. 分类

      经过对每个单模型进行调参之后,我们可以把这些模型进行 stacking 集成。

      如上图所示,我们将数据集分成均匀的5部分进行交叉训练,使用其中的4部分训练,之后将训练好的模型对剩下的1部分进行预测,同时预测测试集;经过5次cv之后,我们可以得到训练集每个样本的预测值,同时得到测试集的5个预测值,我们将测试集的5个测试集进行平均。有多少个基模型,我们会得到几组不同的预测值;最后使用一个模型对上一步得到预测结果再进行训练预测,得到stacking结果。stacking模型一般使用线性模型。

      stacking 有点像神经网络,基模型就像底层的神经网络对输入数据进行特征的提取,如下图所示:

    首先我们先定义一个DataFrame 格式数据结构荣来存储中间预测结果:

    train_proba = np.zeros((len(train), 6))
    train_proba = pd.DataFrame(train_proba)
    train_proba.columns = ['rf','ada','etc','gbc','sk_xgb','sk_lgb']
    
    test_proba = np.zeros((len(test), 6))
    test_proba = pd.DataFrame(test_proba)
    test_proba.columns = ['rf','ada','etc','gbc','sk_xgb','sk_lgb']
    

    定义基模型,交叉训练预测

    rf = RandomForestClassifier(n_estimators=700, max_depth=13, min_samples_split=30,
                                min_weight_fraction_leaf=0.0, random_state=24, verbose=0)
    
    ada = AdaBoostClassifier(n_estimators=450, learning_rate=0.1, random_state=24)
    
    gbc = GradientBoostingClassifier(learning_rate=0.08,n_estimators=150,max_depth=9,
                                      min_samples_leaf=70,min_samples_split=900,
                                      max_features='sqrt',subsample=0.8,random_state=10)
    
    etc = ExtraTreesClassifier(n_estimators=290, max_depth=12, min_samples_split=30,random_state=24)
    
    sk_xgb = XGBClassifier(learning_rate=0.05,n_estimators=400,
                            min_child_weight=20,max_depth=3,subsample=0.8, colsample_bytree=0.8,
                            reg_lambda=1., random_state=10)
    
    sk_lgb = LGBMClassifier(num_leaves=31,max_depth=3,learing_rate=0.03,n_estimators=600,
                         subsample=0.8, colsample_bytree=0.9, objective='binary', 
                         min_child_weight=0.001, subsample_freq=1, min_child_samples=10,
                         reg_alpha=0.0, reg_lambda=0.0, random_state=10, n_jobs=-1, 
                         silent=True, importance_type='split')
    
    kf = StratifiedKFold(n_splits=5,random_state=233,shuffle=True)
    
    clf_name = ['rf','ada','etc','gbc','sk_xgb','sk_lgb']
    for i,clf in enumerate([rf,ada,etc,gbc,sk_xgb,sk_lgb]):
        pred_list = []
        col = clf_name[i] 
        for train_index, val_index in kf.split(train,label):
            X_train = train.loc[train_index,:].values
            y_train = label[train_index]
            X_val = train.loc[val_index,:].values
            y_val = label[val_index]
    
            clf.fit(X_train, y_train)
            y_vali = clf.predict_proba(X_val)[:,1]
            train_proba.loc[val_index,col] = y_vali
            print("%s cv auc %s" % (col, roc_auc_score(y_val, y_vali)))
    
            y_testi = clf.predict_proba(test.values)[:,1]
            pred_list.append(y_testi)
    
        test_proba.loc[:,col] = np.mean(np.array(pred_list),axis=0)
    

    使用逻辑回归做最后的stacking  

    scaler = StandardScaler()
    train_proba = train_proba.values
    test_proba = test_proba.values
    
    scaler.fit(train_proba)
    train_proba = scaler.transform(train_proba)
    test_proba = scaler.transform(test_proba)
    
    
    lr = LogisticRegression(tol=0.0001, C=0.5, random_state=24, max_iter=10)
    
    kf = StratifiedKFold(n_splits=5,random_state=244,shuffle=True)
    auc_list = []
    pred_list = []
    for train_index, val_index in kf.split(train_proba,label):
        X_train = train_proba[train_index]
        y_train = label[train_index]
        X_val = train_proba[val_index]
        y_val = label[val_index]
    
        lr.fit(X_train, y_train)
        y_vali = lr.predict_proba(X_val)[:,1]
        print("lr stacking cv auc %s" % (roc_auc_score(y_val, y_vali)))
    
        auc_list.append(roc_auc_score(y_val, y_vali))
    
        y_testi = lr.predict_proba(test_proba)[:,1]
        pred_list.append(y_testi)
    
    print(lr.coef_, lr.n_iter_)
    

    最终各个基模型和stacking模型的 auc 得分如下图所示:  

     分别为 0.8415,0.8506,0.8511,0.8551,0.8572,0.8580,0.8584。

  • 相关阅读:
    sql server 2008 R2突然用windows和sa都无法登录。昨天都还能登陆,今天突然不行。
    JSON
    String类型判断是否一致
    5-4利用选取事件实时修改订单
    5-3以复选框创建餐点选项
    mysql创建存储过程
    margin标记可以带一个、二个、三个、四个参数,各有不同的含义。
    CSS字间距
    Html合并单元格
    MySQL Innodb存储引擎 事务隔离级别 锁 理解
  • 原文地址:https://www.cnblogs.com/nxf-rabbit75/p/10596180.html
Copyright © 2011-2022 走看看