集成学习综述笔记

zoukankan html css js c++ java

集成学习综述笔记
集成学习

**Ensemble methods 组合模型的方式大致为四个：/bagging / boosting / voting / stacking **

机器学习的算法有很多，对于每一种机器学习算法，考虑问题的方式都略微有所不同，所以对于同一个问题，不同的算法可能会给出不同的结果，那么在这种情况下，我们选择哪个算法的结果作为最终结果呢？那么此时，我们完全可以把多种算法集中起来，让不同算法对同一种问题都进行预测，最终少数服从多数，这就是集成学习的思路。
en's一种机器学习算法，考虑问题的方式都略微有所不同，所以对于同一个问题，不同的算法可能会给出不同的结果，那么在这种情况下，我们选择哪个算法的结果作为最终结果呢？那么此时，我们完全可以把多种算法集中起来，让不同算法对同一种问题都进行预测，最终少数服从多数，这就是集成学习的思路。
```
# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
  
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
  
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
  
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))


<script.py> output:
    Logistic Regression : 0.747
    K Nearest Neighbours : 0.724
    Classification Tree : 0.730
```
VotingClassifier
```
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train,y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

<script.py> output:
    Voting Classifier: 0.753
```
惊喜的发现Voting Classifier的集成学习率达到了0.753，而集成之前的单个学习率Logistic Regression : 0.747， K Nearest Neighbours : 0.724，Classification Tree : 0.730，集成的魅力

我现在可不可以这样理解，我在做智能算法，两个算法两两组合，例如：花授粉算法与粒子群算法进行组合，不过这也不是完全的集成学习，只是集成了一部分，确实可以提升收敛效果。

bagging

特点

平行合奏：每个模型独立构建

旨在减少方差，而不是偏差（因此很可能存在过拟合）

适用于高方差低偏差模型（复杂模型）

基于树的方法的示例是随机森林，其开发完全生长的树（注意，RF修改生长的过程以减少树之间的相关性）

推导

输入
训练集(D=left{left(oldsymbol{x}_{1}, y_{1} ight),left(oldsymbol{x}_{2}, y_{2} ight), ldots,left(oldsymbol{x}_{m}, y_{m} ight) ight})
基学习算法：(mathcal{L})
训练次数：(T)
过程
for (t=1,2, dots, T mathrm{do})
(h_{t}=mathfrak{L}left(D, mathcal{D}_{b s} ight))
end for
输出
(H(oldsymbol{x})=underset{y in mathcal{Y}}{arg max } sum_{t=1}^{T} mathbb{I}left(h_{t}(oldsymbol{x})=y ight))

流程图

实现描述

在scikit-learn中，
参数 max_samples 和 max_features 控制子集的大小（在样本和特征方面）
参数 bootstrap 和 bootstrap_features 控制是否在有或没有替换的情况下绘制样本和特征。

Bagging又叫自助聚集，是一种根据均匀概率分布从数据中重复抽样（有放回）的技术。
每个抽样生成的自助样本集上，训练一个基分类器；对训练过的分类器进行投票，将测试样本指派到得票最高的类中。
每个自助样本集都和原数据一样大
有放回抽样，一些样本可能在同一训练集中出现多次，一些可能被忽略。
csdn

评价

Bagging通过降低基分类器的方差，改善了泛化误差
其性能依赖于基分类器的稳定性；如果基分类器不稳定，bagging有助于降低训练数据的随机波动导致的误差；如果稳定，则集成分类器的误差主要由基分类器的偏倚引起
由于每个样本被选中的概率相同，因此bagging并不侧重于训练数据集中的任何特定实例

BaggingClassifier参数介绍
```
   base_estimator：Object or None。None代表默认是DecisionTree，Object可以指定基估计器（base estimator）。
```
　　　　n_estimators：int, optional (default=10) 。要集成的基估计器的个数。

　　　　max_samples： int or float, optional (default=1.0)。决定从x_train抽取去训练基估计器的样本数量。int 代表抽取数量，float代表抽取比例

　　　　max_features : int or float, optional (default=1.0)。决定从x_train抽取去训练基估计器的特征数量。int 代表抽取数量，float代表抽取比例

　　　　bootstrap : boolean, optional (default=True) 决定样本子集的抽样方式（有放回和不放回）

　　　　bootstrap_features : boolean, optional (default=False)决定特征子集的抽样方式（有放回和不放回）

　　　　oob_score : bool 决定是否使用包外估计（out of bag estimate）泛化误差

　　　　warm_start : bool, optional (default=False) true代表

　　　　n_jobs : int, optional (default=1)

　　　　random_state : int, RandomState instance or None, optional (default=None)。如果int，random_state是随机数生成器使用的种子; 如果RandomState的实例，random_state是随机数生成器; 如果None，则随机数生成器是由np.random使用的RandomState实例。

　　　　verbose : int, optional (default=0)

属性介绍：

　　　　estimators_ : list of estimators。The collection of fitted sub-estimators.

　　　　estimators_samples_ : list of arrays

　　　　estimators_features_ : list of arrays

　　　　oob_score_ : float，使用包外估计这个训练数据集的得分。

　　　　oob_prediction_ : array of shape = [n_samples]。在训练集上用out-of-bag估计计算的预测。如果n_estimator很小，则可能在抽样过程中数据点不会被忽略。在这种情况下，oob_prediction_可能包含NaN。
```
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)

# Fit bc to the training set
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate acc_test
acc_test = accuracy_score(y_test, y_pred)
print('Test set accuracy of bc: {:.2f}'.format(acc_test))

<script.py> output:
    Test set accuracy of bc: 0.71
```
Out of Bag Evaluation

OOB_score
```
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)

# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, 
                       n_estimators=50,
                       oob_score=True,
                       random_state=1)

# Fit bc to the training set 
bc.fit(X_train, y_train)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate test set accuracy
acc_test = accuracy_score(y_test, y_pred)

# Evaluate OOB accuracy
acc_oob = bc.oob_score_

# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))

<script.py> output:
    Test set accuracy: 0.698, OOB accuracy: 0.704
```
Random Forests (RF)

参考这篇文章
个人觉得，先搞明白每个分类器的原理然后，在进行集成学习于我个人而言比较有效果
https://www.cnblogs.com/gaowenxingxing/p/12345225.html

boosting

Adaboost
查看全文

相关阅读:
spring cloud 和阿里微服务spring cloud Alibaba
为WPF中的ContentControl设置背景色
 java RSA 解密
 java OA系统自定义表单流程审批电子印章手写文字识别电子签名即时通讯
 Hystrix 配置参数全解析
 spring cloud 2020 gateway 报错503
Spring Boot 配置 Quartz 定时任务
 Mybatis 整合 ehcache缓存
 Springboot 整合阿里数据库连接池 druid
java OA系统自定义表单流程审批电子印章手写文字识别电子签名即时通讯

原文地址：https://www.cnblogs.com/gaowenxingxing/p/12355856.html

集成学习综述笔记

集成学习

VotingClassifier

bagging

特点

推导

流程图

实现描述

评价

BaggingClassifier参数介绍

属性介绍：

Out of Bag Evaluation

Random Forests (RF)

boosting

Adaboost