zoukankan      html  css  js  c++  java
  • 统计学习方法——实现AdaBoost

    Adaboost

    适用问题:二分类问题

    • 模型:加法模型

    [f(x)=sum_{m=1}^{M} alpha_{m} G_{m}(x) ]

    • 策略:损失函数为指数函数

    [L(y,f(x))=exp[-yf(x)] ]

    • 算法:前向分步算法

    [left(eta_{m}, gamma_{m} ight)=arg min _{eta, gamma} sum_{i=1}^{N} Lleft(y_{i}, f_{m-1}left(x_{i} ight)+eta bleft(x_{i} ; gamma ight) ight) ]

    特点:AdaBoost算法的特点是通过迭代每次学习一个基本分类器。每次迭代中,提高那些被前一轮分类器错误分类数据的权值,而降低那些被正确分类的数据的权值。最后,AdaBoost将基本分类器的线性组合作为强分类器,其中给分类误差率小的基本分类器以大的权值,给分类误差率大的基本分类器以小的权值。

    算法步骤

    1)给每个训练样本((x_{1},x_{2},….,x_{N}))分配权重,初始权重(w_{1})均为1/N。

    2)针对带有权值的样本进行训练,得到模型(G_m)(初始模型为G1)。

    3)计算模型(G_m)的误分率(e_m=sum_{i=1}^Nw_iI(y_i ot= G_m(x_i))) (误分率应小于0.5,否则将预测结果翻转即可得到误分率小于0.5的分类器)

    4)计算模型(G_m)的系数(alpha_m=0.5log[(1-e_m)/e_m])

    5)根据误分率e和当前权重向量(w_m)更新权重向量(w_{m+1})

    6)计算组合模型(f(x)=sum_{m=1}^Malpha_mG_m(x_i))的误分率。

    7)当组合模型的误分率或迭代次数低于一定阈值,停止迭代;否则,回到步骤2)

    提升树

    提升树是以分类树或回归树为基本分类器的提升方法。提升树被认为是统计学习中最有效的方法之一。

    提升方法:将弱可学习算法提升为强可学习算法。提升方法通过反复修改训练数据的权值分布,构建一系列基本分类器(弱分类器),并将这些基本分类器线性组合,构成一个强分类器。AdaBoost算法是提升方法的一个代表。

    AdaBoost源码实现

    假设弱分类器由 (x < v)(x > v) 产生,阈值(v)使该分类器在训练集上分类误差率最低。

    import numpy as np
    import pandas as pd
    from sklearn.datasets import load_iris
    from sklearn.model_selection  import train_test_split
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    
    def create_data():
        iris = load_iris()  # 鸢尾花数据集
        df = pd.DataFrame(iris.data, columns=iris.feature_names)
        df['label'] = iris.target
        data = np.array(df.iloc[:100, [0, 1, -1]])  # 取前一百个数据,只保留前两个特征
        for d in data:
            if d[-1] == 0:
                d[-1] = -1
        return data[:, :2], data[:, -1].astype(np.int)
    
    class AdaBoost:
        def __init__(self, num_classifier, increment=0.5):
            """
            
            num_classifier: 弱分类器的数量
            increment: 在特征上寻找最优切分点时,搜索时每次的增加值(数据稀疏时建议根据样本点来选择)
            """
            self.num_classifier = num_classifier
            self.increment = increment
            
        def fit(self, X, Y):
            self._init_args(X, Y)
            
            # 逐个训练分类器
            for m in range(self.num_classifier):
                min_error, v_optimal, preds = float('INF'), None, None
                direct_split = None
                feature_idx = None  # 选定的特征的列索引
                # 遍历选择特征和切分点使得分类误差最小
                for j in range(self.num_feature):
                    feature_values = self.X[:, j]  # 第j个特征对应的所有取值
                    _ret = self._get_optimal_split(feature_values)
                    v_split, _direct_split, error, pred_labels = _ret
                    
                    if error < min_error:
                        min_error = error
                        v_optimal = v_split
                        preds = pred_labels
                        direct_split = _direct_split
                        feature_idx = j
                
                # 计算分类型权重alpha
                alpha = self._cal_alpha(min_error)
                self.alphas.append(alpha)
                
                # 记录当前分类器G(x)
                self.classifiers.append((feature_idx, v_optimal, direct_split))
                
                # 更新样本集合权值分布
                self._update_weights(alpha, preds)
        
        def predict(self, x):
            res = 0.0
            for i in range(len(self.classifiers)):
                idx, v, direct = self.classifiers[i]
                # 输入弱分类器进行分类 
                if direct == '>':
                    output = 1 if x[idx] > v else -1
                else:  # direct == '<'
                    output = -1 if x[idx] > v else 1
                    
                res += self.alphas[i] * output
            return 1 if res > 0 else -1  # sign(res)
        
        def score(self, X_test, Y_test):
            cnt = 0
            for i, x in enumerate(X_test):
                if self.predict(x) == Y_test[i]:
                    cnt += 1
            return cnt / len(X_test)
        
        def _init_args(self, X, Y):
            self.X = X
            self.Y = Y
            self.N, self.num_feature = X.shape  # N:样本数,num_feature:特征数量
            
            # 初始时每个样本的权重均相同
            self.weights = [1/self.N] * self.N
            
            # 弱分类器集合
            self.classifiers = []
            
            # 每个分类器G(x)的权重
            self.alphas = []
                
        def _update_weights(self, alpha, pred_labels):
            # 计算规范化因子Z
            Z = self._cal_norm_factor(alpha, pred_labels)
            for i in range(self.N):
                self.weights[i] = (self.weights[i] *
                                   np.exp(-1*alpha*self.Y[i]*pred_labels[i]) / Z)
                        
        def _cal_alpha(self, error):
            return 0.5 * np.log((1-error)/error)
                    
        def _cal_norm_factor(self, alpha, pred_labels):
            return sum([self.weights[i] * np.exp(-1*alpha*self.Y[i]*pred_labels[i])
                        for i in range(self.N)])
                    
        def _get_optimal_split(self, feature_values):
            error = float('INF')  # 分类误差
            pred_labels = []  # 分类结果
            v_split_optimal = None  # 当前特征的最优切割点
            direct_split = None  # 最优切割点的判别方向
            max_v = max(feature_values)
            min_v = min(feature_values)
            num_step = (max_v - min_v + self.increment)/self.increment
            for i in range(int(num_step)):
                # 选取分割点
                v_split = min_v + i * self.increment
                judge_direct = '>'
                preds = [1 if feature_values[k] > v_split else -1 
                         for k in range(len(feature_values))]
                
                # 错误样本加权误差
                weight_error = sum([self.weights[k] for k in range(self.N)
                                    if preds[k] != self.Y[k]])
    
                # 计算分类标签翻转后的误差
                preds_inv = [-p for p in preds]
                weight_error_inv = sum([self.weights[k] for k in range(self.N)
                                   if preds_inv[k] != self.Y[k]])
    
                # 取较小误差的判别方向作为分类器的判别方向
                if weight_error_inv < weight_error:
                    preds = preds_inv
                    weight_error = weight_error_inv
                    judge_direct = '<'
    
                if weight_error < error:
                    error = weight_error
                    pred_labels = preds
                    v_split_optimal = v_split
                    direct_split = judge_direct
    
            return v_split_optimal, direct_split, error, pred_labels
    

    测试模型准确率:

    X, Y = create_data()
    
    res = []
    for i in range(10):
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
    
        clf = AdaBoost(num_classifier=50)
        clf.fit(X_train, Y_train)
        res.append(clf.score(X_test, Y_test))
    print('My AdaBoost: {}次的平均准确率:  {:.3f}'.format(len(res), sum(res)/len(res)))
    
    My AdaBoost: 10次的平均准确率:  0.970
    

    sklearn库的AdaBoost实例

    from sklearn.ensemble import AdaBoostClassifier
    
    res = []
    for i in range(10):
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
    
        clf_sklearn = AdaBoostClassifier(n_estimators=50, learning_rate=0.5)
        clf_sklearn.fit(X_train, Y_train)
        res.append(clf_sklearn.score(X_test, Y_test))
    print('sklearn AdaBoostClassifier: {}次的平均准确率:  {:.3f}'.format(
        len(res), sum(res)/len(res)))
    
    sklearn AdaBoostClassifier: 10次的平均准确率:  0.945
  • 相关阅读:
    ZOJ4125 Sekiro
    ZOJ4118 Stones in the Bucket
    ZOJ4115 Wandering Robot
    ZOJ4113 Calandar
    【递归】N皇后问题 和 2n皇后问题 dfs
    7-18
    7_13
    二维前缀和
    64位整数乘法
    【分治】魔法石的诱惑
  • 原文地址:https://www.cnblogs.com/irvingluo/p/14557070.html
Copyright © 2011-2022 走看看