zoukankan      html  css  js  c++  java
  • GBDT,随机森林

    author:yangjing

    time:2018-10-22


    Gradient boosting decision tree

    1.main diea

    The main idea behind GBDT is to combine many simple models(also known as week kernels),like shallow trees.Each tree can only provide good predictions on part of the data,and so more and more trees are added to iteratively improve performance.

    2.parameters setting

    the algorithm is a bit more sensitive to parameter settings than random forests,but can provide better accuracy if the parameters are set correctly.

    • number of trees
      By increasing n_estimators ,also increasing the model complexity,as the model has more chances to correct misticks on the training set.
    • learning rate
      controns how strongly each tree tries to correct the misticks of the previous trees.A higher learning rate means each tree can make stronger correctinos,allowing for more complex models.
    • max_depth
      or alternatively max_leaf_nodes.Usyally max_depth is set very low for gradient-boosted models,often not deeper than five splits.

    3.code

    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_breast_cancer
    cancer=load_breast_cancer()
    X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
    gbrt=GradientBoostingClassifier(random_state=0)
    gbrt.fit(X_train,y_train)
    gbrt.score(X_test,y_test)
    

    In [261]: X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
         ...: gbrt=GradientBoostingClassifier(random_state=0)
         ...: gbrt.fit(X_train,y_train)
         ...: gbrt.score(X_test,y_test)
         ...:
    Out[261]: 0.958041958041958
    
    In [262]: gbrt.feature_importances_
    Out[262]:
    array([0.01337291, 0.04201687, 0.0208666 , 0.01889077, 0.01028091,
           0.03215986, 0.02074619, 0.11678956, 0.00820024, 0.00074312,
           0.02042134, 0.00680047, 0.02023052, 0.03907398, 0.05406751,
           0.04795741, 0.02358101, 0.00934718, 0.00593481, 0.0239241 ,
           0.05354265, 0.06160083, 0.10961728, 0.07395201, 0.01867851,
           0.03842953, 0.01915824, 0.07128703, 0.01773659, 0.00059199])
    
    In [263]: gbrt.learning_rate
    Out[263]: 0.1
    
    In [264]: gbrt.max_depth
    Out[264]: 3
    
    In [265]: len(gbrt.estimators_)
    Out[266]: 100
    
    In [272]: gbrt.get_params()
    Out[272]:
    {'criterion': 'friedman_mse',
     'init': None,
     'learning_rate': 0.1,
     'loss': 'deviance',
     'max_depth': 3,
     'max_features': None,
     'max_leaf_nodes': None,
     'min_impurity_decrease': 0.0,
     'min_impurity_split': None,
     'min_samples_leaf': 1,
     'min_samples_split': 2,
     'min_weight_fraction_leaf': 0.0,
     'n_estimators': 100,
     'presort': 'auto',
     'random_state': 0,
     'subsample': 1.0,
     'verbose': 0,
     'warm_start': False}
    

    Random forest

    In [230]: y
    Out[230]:
    array([1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
           0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
           1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
           0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
           0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0], dtype=int64)
    
    In [231]: axes.ravel()
    Out[231]:
    array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F3694A8>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46C099F28>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46E6E3BE0>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46BEB72E8>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46ED67198>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F292C88>],
          dtype=object)
    
    In [232]: from sklearn.model_selection import train_test_split
    
    In [233]: X_trai,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=42)
    
    In [234]: len(X_trai)
    Out[234]: 75
    
    In [235]: fores=RandomForestClassifier(n_estimators=5,random_state=2)
    
    In [236]: fores.fit(X_trai,y_train)
    Out[236]:
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
                oob_score=False, random_state=2, verbose=0, warm_start=False)
    
    In [237]: fores.score(X_test,y_test)
    Out[237]: 0.92
    
    In [238]: fores.estimators_
    Out[238]:
    [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=1872583848, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=794921487, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=111352301, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=1853453896, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=213298710, splitter='best')]
    

  • 相关阅读:
    jieba库与词云的使用——以孙子兵法为例
    Python 计算π及进度条显示
    风车
    UDP协议编程
    matplotlib的基本使用、绘制雷达图及绘制数学函数
    numpy库的学习笔记及自定义手绘风
    PIL库的学习总结及生成GIF
    预测球队比赛结果及利用pyinstaller打包文件
    汉诺塔的实现
    有进度条的圆周率计算
  • 原文地址:https://www.cnblogs.com/yangjing000/p/9832234.html
Copyright © 2011-2022 走看看