zoukankan      html  css  js  c++  java
  • GBDT,随机森林

    author:yangjing

    time:2018-10-22


    Gradient boosting decision tree

    1.main diea

    The main idea behind GBDT is to combine many simple models(also known as week kernels),like shallow trees.Each tree can only provide good predictions on part of the data,and so more and more trees are added to iteratively improve performance.

    2.parameters setting

    the algorithm is a bit more sensitive to parameter settings than random forests,but can provide better accuracy if the parameters are set correctly.

    • number of trees
      By increasing n_estimators ,also increasing the model complexity,as the model has more chances to correct misticks on the training set.
    • learning rate
      controns how strongly each tree tries to correct the misticks of the previous trees.A higher learning rate means each tree can make stronger correctinos,allowing for more complex models.
    • max_depth
      or alternatively max_leaf_nodes.Usyally max_depth is set very low for gradient-boosted models,often not deeper than five splits.

    3.code

    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_breast_cancer
    cancer=load_breast_cancer()
    X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
    gbrt=GradientBoostingClassifier(random_state=0)
    gbrt.fit(X_train,y_train)
    gbrt.score(X_test,y_test)
    

    In [261]: X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
         ...: gbrt=GradientBoostingClassifier(random_state=0)
         ...: gbrt.fit(X_train,y_train)
         ...: gbrt.score(X_test,y_test)
         ...:
    Out[261]: 0.958041958041958
    
    In [262]: gbrt.feature_importances_
    Out[262]:
    array([0.01337291, 0.04201687, 0.0208666 , 0.01889077, 0.01028091,
           0.03215986, 0.02074619, 0.11678956, 0.00820024, 0.00074312,
           0.02042134, 0.00680047, 0.02023052, 0.03907398, 0.05406751,
           0.04795741, 0.02358101, 0.00934718, 0.00593481, 0.0239241 ,
           0.05354265, 0.06160083, 0.10961728, 0.07395201, 0.01867851,
           0.03842953, 0.01915824, 0.07128703, 0.01773659, 0.00059199])
    
    In [263]: gbrt.learning_rate
    Out[263]: 0.1
    
    In [264]: gbrt.max_depth
    Out[264]: 3
    
    In [265]: len(gbrt.estimators_)
    Out[266]: 100
    
    In [272]: gbrt.get_params()
    Out[272]:
    {'criterion': 'friedman_mse',
     'init': None,
     'learning_rate': 0.1,
     'loss': 'deviance',
     'max_depth': 3,
     'max_features': None,
     'max_leaf_nodes': None,
     'min_impurity_decrease': 0.0,
     'min_impurity_split': None,
     'min_samples_leaf': 1,
     'min_samples_split': 2,
     'min_weight_fraction_leaf': 0.0,
     'n_estimators': 100,
     'presort': 'auto',
     'random_state': 0,
     'subsample': 1.0,
     'verbose': 0,
     'warm_start': False}
    

    Random forest

    In [230]: y
    Out[230]:
    array([1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
           0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
           1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
           0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
           0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0], dtype=int64)
    
    In [231]: axes.ravel()
    Out[231]:
    array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F3694A8>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46C099F28>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46E6E3BE0>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46BEB72E8>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46ED67198>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F292C88>],
          dtype=object)
    
    In [232]: from sklearn.model_selection import train_test_split
    
    In [233]: X_trai,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=42)
    
    In [234]: len(X_trai)
    Out[234]: 75
    
    In [235]: fores=RandomForestClassifier(n_estimators=5,random_state=2)
    
    In [236]: fores.fit(X_trai,y_train)
    Out[236]:
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
                oob_score=False, random_state=2, verbose=0, warm_start=False)
    
    In [237]: fores.score(X_test,y_test)
    Out[237]: 0.92
    
    In [238]: fores.estimators_
    Out[238]:
    [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=1872583848, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=794921487, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=111352301, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=1853453896, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=213298710, splitter='best')]
    

  • 相关阅读:
    Canny边缘检测高低阈值问题
    EffectiveC++ Item25测试
    下载videolectures.net里的视频的方法
    01章 象数易理篇之一
    论文格式问题
    反序列化笔记
    什么是动爻
    EffectiveC++ Item25说的东东
    优化功能
    论文修改意见
  • 原文地址:https://www.cnblogs.com/yangjing000/p/9832234.html
Copyright © 2011-2022 走看看