zoukankan      html  css  js  c++  java
  • GBDT,随机森林

    author:yangjing

    time:2018-10-22


    Gradient boosting decision tree

    1.main diea

    The main idea behind GBDT is to combine many simple models(also known as week kernels),like shallow trees.Each tree can only provide good predictions on part of the data,and so more and more trees are added to iteratively improve performance.

    2.parameters setting

    the algorithm is a bit more sensitive to parameter settings than random forests,but can provide better accuracy if the parameters are set correctly.

    • number of trees
      By increasing n_estimators ,also increasing the model complexity,as the model has more chances to correct misticks on the training set.
    • learning rate
      controns how strongly each tree tries to correct the misticks of the previous trees.A higher learning rate means each tree can make stronger correctinos,allowing for more complex models.
    • max_depth
      or alternatively max_leaf_nodes.Usyally max_depth is set very low for gradient-boosted models,often not deeper than five splits.

    3.code

    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_breast_cancer
    cancer=load_breast_cancer()
    X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
    gbrt=GradientBoostingClassifier(random_state=0)
    gbrt.fit(X_train,y_train)
    gbrt.score(X_test,y_test)
    

    In [261]: X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
         ...: gbrt=GradientBoostingClassifier(random_state=0)
         ...: gbrt.fit(X_train,y_train)
         ...: gbrt.score(X_test,y_test)
         ...:
    Out[261]: 0.958041958041958
    
    In [262]: gbrt.feature_importances_
    Out[262]:
    array([0.01337291, 0.04201687, 0.0208666 , 0.01889077, 0.01028091,
           0.03215986, 0.02074619, 0.11678956, 0.00820024, 0.00074312,
           0.02042134, 0.00680047, 0.02023052, 0.03907398, 0.05406751,
           0.04795741, 0.02358101, 0.00934718, 0.00593481, 0.0239241 ,
           0.05354265, 0.06160083, 0.10961728, 0.07395201, 0.01867851,
           0.03842953, 0.01915824, 0.07128703, 0.01773659, 0.00059199])
    
    In [263]: gbrt.learning_rate
    Out[263]: 0.1
    
    In [264]: gbrt.max_depth
    Out[264]: 3
    
    In [265]: len(gbrt.estimators_)
    Out[266]: 100
    
    In [272]: gbrt.get_params()
    Out[272]:
    {'criterion': 'friedman_mse',
     'init': None,
     'learning_rate': 0.1,
     'loss': 'deviance',
     'max_depth': 3,
     'max_features': None,
     'max_leaf_nodes': None,
     'min_impurity_decrease': 0.0,
     'min_impurity_split': None,
     'min_samples_leaf': 1,
     'min_samples_split': 2,
     'min_weight_fraction_leaf': 0.0,
     'n_estimators': 100,
     'presort': 'auto',
     'random_state': 0,
     'subsample': 1.0,
     'verbose': 0,
     'warm_start': False}
    

    Random forest

    In [230]: y
    Out[230]:
    array([1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
           0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
           1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
           0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
           0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0], dtype=int64)
    
    In [231]: axes.ravel()
    Out[231]:
    array([<matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F3694A8>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46C099F28>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46E6E3BE0>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46BEB72E8>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46ED67198>,
           <matplotlib.axes._subplots.AxesSubplot object at 0x000001F46F292C88>],
          dtype=object)
    
    In [232]: from sklearn.model_selection import train_test_split
    
    In [233]: X_trai,X_test,y_train,y_test=train_test_split(X,y,stratify=y,random_state=42)
    
    In [234]: len(X_trai)
    Out[234]: 75
    
    In [235]: fores=RandomForestClassifier(n_estimators=5,random_state=2)
    
    In [236]: fores.fit(X_trai,y_train)
    Out[236]:
    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                max_depth=None, max_features='auto', max_leaf_nodes=None,
                min_impurity_decrease=0.0, min_impurity_split=None,
                min_samples_leaf=1, min_samples_split=2,
                min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
                oob_score=False, random_state=2, verbose=0, warm_start=False)
    
    In [237]: fores.score(X_test,y_test)
    Out[237]: 0.92
    
    In [238]: fores.estimators_
    Out[238]:
    [DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=1872583848, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=794921487, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=111352301, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=1853453896, splitter='best'),
     DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                 max_features='auto', max_leaf_nodes=None,
                 min_impurity_decrease=0.0, min_impurity_split=None,
                 min_samples_leaf=1, min_samples_split=2,
                 min_weight_fraction_leaf=0.0, presort=False,
                 random_state=213298710, splitter='best')]
    

  • 相关阅读:
    Java运行时数据区
    关于Java中的内存屏障
    Java中对象在内存中的大小、分配等问题
    【java基础】两个日期的比较大小的几种方法。
    报错信息: java.sql.SQLException: 不支持的字符集 (在类路径中添加 orai18n.jar): ZHS16GBK
    linux 中文件按照时间倒序或者升序显示
    maven本地仓库存在为什么还要连接外网下载?
    【奇奇怪怪的代码问题】-springboot前后端时间不一致
    日常问题-使用maven jetty插件启动慢的一些解决方法
    Mybatis 框架下 SQL 注入攻击的 3 种方式
  • 原文地址:https://www.cnblogs.com/yangjing000/p/9832234.html
Copyright © 2011-2022 走看看