zoukankan      html  css  js  c++  java
  • AdaBoost元算法

    boosting:不同的分类器是通过串行训练而获得的,每个新分类器都根据已经训练出的分类器的性能来进行训练。通过集中关注被已有分类器错分的那些样本来获得新的分类器。

    权重alpha:弱分类器的线性组合系数,用来构成完整分类器。对每个数据的分类时,其结果是弱分类器结果的线性组合。

    权重D:样本的权重向量,每个元素表征对应样本的重要性。m*1阶列向量。

    基于单层决策树构建弱分类器:仅基于单个特征来做决策。

    单层决策树生成函数:

    from numpy import *
    def loadSimpData():
        datMat = matrix([[ 1. ,  2.1],
            [ 2. ,  1.1],
            [ 1.3,  1. ],
            [ 1. ,  1. ],
            [ 2. ,  1. ]])
        classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
        return datMat,classLabels
    def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):
        retArray=ones((shape(dataMatrix)[0],1))
        if threshIneq=='lt':
            retArray[dataMatrix[:,dimen]<=threshVal]=-1.0
        else:
            retArray[dataMatrix[:,dimen]>threshVal]=-1.0
        return retArray
    def buildStump(dataArr,classLabels,D):
        dataMatrix=mat(dataArr)
        labelMat=mat(classLabels).T
        m,n=shape(dataMatrix)
        numSteps=10.0
        bestStump={}
        bestClasEst=mat(zeros((m,1)))
        minError=inf
        for i in range(n):
            rangeMin=dataMatrix[:,i].min()
            rangeMax=dataMatrix[:,i].max()
            stepSize=(rangeMax-rangeMin)/numSteps
            for j in range(-1,int(numSteps)+1):
                for inequal in ['lt','gt']:
                    threshVal=(rangeMin+float(j)*stepSize)
                    predictVals=stumpClassify(dataMatrix,i,threshVal,inequal)
                    errArr=mat(ones((m,1)))
                    errArr[predictVals==labelMat]=0
                    weightedError=D.T*errArr        #1*m m*1 ==>标量
                    print('split:dim %d,thresh:%.2f,thresh inequal:%s,the weighted error is %.3f'%(i,threshVal,inequal,weightedError))
                    if weightedError<minError:
                        minError=weightedError
                        bestClasEst=predictVals.copy()
                        bestStump['dim']=i
                        bestStump['thresh']=threshVal
                        bestStump['ineq']=inequal
        return bestStump,minError,bestClasEst
    if __name__=='__main__':
        D=mat(ones((5,1))/5)
        datMat, classLabels=loadSimpData()
        buildStump(datMat,classLabels,D)
        
    

    stumpClassify(dataMatrix,dimen,threshVal,threshIneq):单层决策树,通过阈值比较对数据分类。所有在阈值一边的数据会分为-1,另一边的数据分为+1.该函数通过数组过滤来实现。分为两种模式:小于等于阈值分为-1,大于阈值分为+1;或者相反。

    weightedError=D.T*errArr     #1*m m*1 ==>标量   将错误向量errArr和权重向量D的相应元素相乘并求和,得到数值weightedError,这就是AdaBoost与分类器交互的地方。这里基于权重向量D而不是其他错误计算指标来评价分类器。

    输出

    split:dim 0,thresh:0.90,thresh inequal:lt,the weighted error is 0.400
    split:dim 0,thresh:0.90,thresh inequal:gt,the weighted error is 0.600
    split:dim 0,thresh:1.00,thresh inequal:lt,the weighted error is 0.400
    split:dim 0,thresh:1.00,thresh inequal:gt,the weighted error is 0.600
    split:dim 0,thresh:1.10,thresh inequal:lt,the weighted error is 0.400
    split:dim 0,thresh:1.10,thresh inequal:gt,the weighted error is 0.600
    split:dim 0,thresh:1.20,thresh inequal:lt,the weighted error is 0.400
    split:dim 0,thresh:1.20,thresh inequal:gt,the weighted error is 0.600
    split:dim 0,thresh:1.30,thresh inequal:lt,the weighted error is 0.200
    split:dim 0,thresh:1.30,thresh inequal:gt,the weighted error is 0.800
    split:dim 0,thresh:1.40,thresh inequal:lt,the weighted error is 0.200
    split:dim 0,thresh:1.40,thresh inequal:gt,the weighted error is 0.800
    split:dim 0,thresh:1.50,thresh inequal:lt,the weighted error is 0.200
    split:dim 0,thresh:1.50,thresh inequal:gt,the weighted error is 0.800
    split:dim 0,thresh:1.60,thresh inequal:lt,the weighted error is 0.200
    split:dim 0,thresh:1.60,thresh inequal:gt,the weighted error is 0.800
    split:dim 0,thresh:1.70,thresh inequal:lt,the weighted error is 0.200
    split:dim 0,thresh:1.70,thresh inequal:gt,the weighted error is 0.800
    split:dim 0,thresh:1.80,thresh inequal:lt,the weighted error is 0.200
    split:dim 0,thresh:1.80,thresh inequal:gt,the weighted error is 0.800
    split:dim 0,thresh:1.90,thresh inequal:lt,the weighted error is 0.200
    split:dim 0,thresh:1.90,thresh inequal:gt,the weighted error is 0.800
    split:dim 0,thresh:2.00,thresh inequal:lt,the weighted error is 0.600
    split:dim 0,thresh:2.00,thresh inequal:gt,the weighted error is 0.400
    split:dim 1,thresh:0.89,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:0.89,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.00,thresh inequal:lt,the weighted error is 0.200
    split:dim 1,thresh:1.00,thresh inequal:gt,the weighted error is 0.800
    split:dim 1,thresh:1.11,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.11,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.22,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.22,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.33,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.33,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.44,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.44,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.55,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.55,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.66,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.66,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.77,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.77,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.88,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.88,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:1.99,thresh inequal:lt,the weighted error is 0.400
    split:dim 1,thresh:1.99,thresh inequal:gt,the weighted error is 0.600
    split:dim 1,thresh:2.10,thresh inequal:lt,the weighted error is 0.600
    split:dim 1,thresh:2.10,thresh inequal:gt,the weighted error is 0.400
    View Code

     基于单层决策树的AdaBoost训练过程:

    def adaBoostTrainDS(dataArr,classLabels,numIter=40):
        weakClassArr=[]
        m=shape(dataArr)[0]
        D=mat(ones((m,1))/m)    #每个样本的权重均初始化为1/m
        aggClassEst=mat(zeros((m,1)))
        for i in range(numIter):
            beatStump,error,classEst=buildStump(dataArr,classLabels,D)
            print('D:',D.T)
            alpha=float(0.5*log((1.0-error)/max(error,e-16)))
            print('alpha:',alpha)
            beatStump['alpha']=alpha
            weakClassArr.append(beatStump)
            print('分类估计:',classEst.T)
            expon=multiply(-1*alpha*mat(classLabels).T,classEst)
            D=multiply(D,exp(expon))
            D=D/D.sum()
            aggClassEst+=alpha*classEst
            print('aggClassEst:',aggClassEst.T)
            aggErrors=multiply(sign(aggClassEst)!=mat(classLabels).T,ones((m,1)))
            errorRate=aggErrors.sum()/m
            print('total error:',errorRate,'
    ')
            if errorRate==0.0:
                break
        return weakClassArr
    

    D是概率分布向量,D中所有元素之和等于1.

    首先利用前面的buildStump()函数建立一个单层决策树。该函数的输入为权重向量D,返回的则是利用D得到的具有最小错误率的单层决策树,同时返回的还有最小的错误率以及预测的类别向量。

    alpha=float(0.5*log((1.0-error)/max(error,e-16)))  其中的max(error,e-16)是用来防止error很小时发生的除零溢出。

    aggClassEst是m*1阶的列向量,用来存储运行时的类别估计值,符号代表预测结果,为正时表示目前此样本的预测类别为1,为负时表示-1.

    aggClassEst+=alpha*classEst  用各弱分类器的分类结果与权重alpha的线性组合值作为最终的预测值。迭代一次,就产生一个弱分类器,相当于对最终的结果修正一次。

    aggErrors=multiply(sign(aggClassEst)!=mat(classLabels).T,ones((m,1)))  将分类错误的样本对应位置设置为1,方便求出错误分类总数和错误率。

    测试AdaBoost:

    if __name__=='__main__':
        D=mat(ones((5,1))/5)
        datMat, classLabels=loadSimpData()
        classifyArray=adaBoostTrainDS(datMat,classLabels,9)
        print(classifyArray)
    

     输出:

    D: [[0.2 0.2 0.2 0.2 0.2]]
    alpha: 0.6931471805599453
    分类估计: [[-1.  1. -1. -1.  1.]]
    aggClassEst: [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]
    total error: 0.2 
    
    D: [[0.5   0.125 0.125 0.125 0.125]]
    alpha: 0.9729550745276565
    分类估计: [[ 1.  1. -1. -1. -1.]]
    aggClassEst: [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]
    total error: 0.2 
    
    D: [[0.28571429 0.07142857 0.07142857 0.07142857 0.5       ]]
    alpha: 0.8958797346140273
    分类估计: [[1. 1. 1. 1. 1.]]
    aggClassEst: [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]
    total error: 0.0 
    
    [{'alpha': 0.6931471805599453, 'dim': 0, 'ineq': 'lt', 'thresh': 1.3}, {'alpha': 0.9729550745276565, 'dim': 1, 'ineq': 'lt', 'thresh': 1.0}, {'alpha': 0.8958797346140273, 'dim': 0, 'ineq': 'lt', 'thresh': 0.9}]
    

     classifyArray是数组,由三个弱分类器组成,包含了分类所需的所有信息。此时的训练错误率为0,以下讨论其测试错误率。


    上述函数的返回值中含有弱分类器及其alpha值,容易进行测试:只需要将弱分类器提取出来作用到待分类数据上,每个弱分类器的结果以其对应的alpha值为权重,所有这些弱分类器的结果加权求和就得到了最后的结果。

    if __name__=='__main__':
        D=mat(ones((5,1))/5)
        datMat, classLabels=loadSimpData()
        classifyArray=adaBoostTrainDS(datMat,classLabels,9)
        result=adaClassify([[5,5],[0,0]],classifyArray)
        print('最终分类结果为:',result)
    

     输出:

    aggClassEst: [[ 0.69314718]
     [-0.69314718]]
    aggClassEst: [[ 1.66610226]
     [-1.66610226]]
    aggClassEst: [[ 2.56198199]
     [-2.56198199]]
    最终分类结果为: [[ 1.]
     [-1.]]
    

     由aggClassEst可以看出,随着三个弱分类器的叠加,其预测结果越来越强,即为离分类边界值0的距离越来越远。


    在一个难数据集上应用AdaBoost,预测疝病马能否存活。

     自适应数据加载函数,不需指定每个文件中的特征数目,并且假定最后一列数据是类别标签。

    def loadDataSet(filename):
        numFeatures = len(open(filename).readline().split('	')) - 1
        dataMat = []
        labelMat = []
        f = open(filename)
        for line in f.readlines():
            lineArr=[]
            curLine=line.strip().split('	')
            for i in range(0,numFeatures):
                lineArr.append(float(curLine[i]))
            dataMat.append(lineArr)
            labelMat.append(float(curLine[-1]))
        return dataMat,labelMat
    

     用疝病马数据集测试元算法:

    if __name__=='__main__':
        dataArr,labelArr=loadDataSet('horseColicTraining2.txt')
        classifyArray=adaBoostTrainDS(dataArr,labelArr,10)
        testArr,testLabelArr=loadDataSet('horseColicTest2.txt')
        prediction10=adaClassify(testArr,classifyArray)
        errArr=mat(ones((67,1)))
        count=errArr[prediction10!=mat(testLabelArr).T].sum()
        print(prediction10)
        print(count)
    

     输出:

    total error: 0.2842809364548495 
    
    total error: 0.2842809364548495 
    
    total error: 0.24749163879598662 
    
    total error: 0.24749163879598662 
    
    total error: 0.25418060200668896 
    
    total error: 0.2408026755852843 
    
    total error: 0.2408026755852843 
    
    total error: 0.22073578595317725 
    
    total error: 0.24749163879598662 
    
    total error: 0.23076923076923078 
    
    [[ 1.]
     [ 1.]
     [ 1.]
     [-1.]
     [ 1.]
     [ 1.]
     [-1.]
     [ 1.]
     [ 1.]
     [-1.]
     [-1.]
     [-1.]
     [-1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [-1.]
     [-1.]
     [-1.]
     [-1.]
     [ 1.]
     [-1.]
     [-1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [-1.]
     [-1.]
     [-1.]
     [-1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [-1.]
     [-1.]
     [ 1.]
     [-1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [-1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [-1.]
     [ 1.]
     [-1.]
     [ 1.]
     [-1.]
     [-1.]
     [ 1.]
     [ 1.]
     [ 1.]
     [ 1.]]
    16.0
    View Code

    迭代了10次,产生10个弱分类器,训练错误率最终为:total error: 0.23076923076923078

    测试数据集上有67个样本,分类结果中有16个错误,错误率为16/67=0.23880597014925373,比起logistic回归预测结果35%的错误率降低很多。

    from numpy import *
    def loadSimpData():
        datMat = matrix([[ 1. ,  2.1],
            [ 2. ,  1.1],
            [ 1.3,  1. ],
            [ 1. ,  1. ],
            [ 2. ,  1. ]])
        classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
        return datMat,classLabels
    def loadDataSet(filename):
        numFeatures = len(open(filename).readline().split('	')) - 1
        dataMat = []
        labelMat = []
        f = open(filename)
        for line in f.readlines():
            lineArr=[]
            curLine=line.strip().split('	')
            for i in range(0,numFeatures):
                lineArr.append(float(curLine[i]))
            dataMat.append(lineArr)
            labelMat.append(float(curLine[-1]))
        return dataMat,labelMat
    
    
    def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):
        retArray=ones((shape(dataMatrix)[0],1))
        if threshIneq=='lt':
            retArray[dataMatrix[:,dimen]<=threshVal]=-1.0
        else:
            retArray[dataMatrix[:,dimen]>threshVal]=-1.0
        return retArray
    def buildStump(dataArr,classLabels,D):
        dataMatrix=mat(dataArr)
        labelMat=mat(classLabels).T
        m,n=shape(dataMatrix)
        numSteps=10.0
        bestStump={}
        bestClasEst=mat(zeros((m,1)))
        minError=inf
        for i in range(n):
            rangeMin=dataMatrix[:,i].min()
            rangeMax=dataMatrix[:,i].max()
            stepSize=(rangeMax-rangeMin)/numSteps
            for j in range(-1,int(numSteps)+1):
                for inequal in ['lt','gt']:
                    threshVal=(rangeMin+float(j)*stepSize)
                    predictVals=stumpClassify(dataMatrix,i,threshVal,inequal)
                    errArr=mat(ones((m,1)))
                    errArr[predictVals==labelMat]=0
                    weightedError=D.T*errArr        #1*m m*1 ==>标量
                    # print('split:dim %d,thresh:%.2f,thresh inequal:%s,the weighted error is %.3f'%(i,threshVal,inequal,weightedError))
                    if weightedError<minError:
                        minError=weightedError
                        bestClasEst=predictVals.copy()
                        bestStump['dim']=i
                        bestStump['thresh']=threshVal
                        bestStump['ineq']=inequal
        return bestStump,minError,bestClasEst
    def adaBoostTrainDS(dataArr,classLabels,numIter=40):
        weakClassArr=[]
        m=shape(dataArr)[0]
        D=mat(ones((m,1))/m)    #每个样本的权重均初始化为1/m
        aggClassEst=mat(zeros((m,1)))
        for i in range(numIter):
            beatStump,error,classEst=buildStump(dataArr,classLabels,D)
            #print('D:',D.T)
            alpha=float(0.5*log((1.0-error)/max(error,e-16)))
            #print('alpha:',alpha)
            beatStump['alpha']=alpha
            weakClassArr.append(beatStump)
            #print('分类估计:',classEst.T)
            expon=multiply(-1*alpha*mat(classLabels).T,classEst)
            D=multiply(D,exp(expon))
            D=D/D.sum()
            aggClassEst+=alpha*classEst
            #print('aggClassEst:',aggClassEst.T)
            aggErrors=multiply(sign(aggClassEst)!=mat(classLabels).T,ones((m,1)))
            errorRate=aggErrors.sum()/m
            # print('total error:',errorRate,'
    ')
            if errorRate==0.0:
                break
        return weakClassArr,aggClassEst
    def adaClassify(dataToClass,classifierArr):
        dataMatrix=mat(dataToClass)
        m=shape(dataMatrix)[0]
        aggClassEst=zeros((m,1))
        for i in range(len(classifierArr)):
            classEst=stumpClassify(dataMatrix,classifierArr[i]['dim'],classifierArr[i]['thresh'],classifierArr[i]['ineq'])
            aggClassEst+=classifierArr[i]['alpha']*classEst
            # print("aggClassEst:",aggClassEst)
        return sign(aggClassEst)
    def plotROC(predStrengths,classLabels):
        import matplotlib.pyplot as plt
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        cur = (1.0, 1.0)  # cursor
        ySum = 0.0  # variable to calculate AUC
        numPosClas = sum(array(classLabels) == 1.0)
        yStep = 1 / float(numPosClas)
        xStep = 1 / float(len(classLabels) - numPosClas)
        sortedIndicies = predStrengths.argsort()  # get sorted index, it's reverse
        fig = plt.figure()
        fig.clf()
        ax = plt.subplot(111)
        # print(type(sortedIndicies))
        # print(sortedIndicies.tolist())
        for index in sortedIndicies.tolist()[0]:
            if classLabels[index] == 1.0:
                delX = 0
                delY = yStep
            else:
                delX = xStep
                delY = 0
                ySum += cur[1]
            # draw line from cur to (cur[0]-delX,cur[1]-delY)
            ax.plot([cur[0], cur[0] - delX], [cur[1], cur[1] - delY], c='y')
            cur = (cur[0] - delX, cur[1] - delY)
        ax.plot([0, 1], [0, 1], 'r--')
        plt.xlabel('假阳率')
        plt.ylabel('真阳率')
        plt.title('AdaBoost马疝病检测系统的ROC曲线')
        ax.axis([0, 1, 0, 1])
        plt.show()
        print("the Area Under the Curve is: ", ySum * xStep)
    
    
    
    if __name__=='__main__':
        dataArr,labelArr=loadDataSet('horseColicTraining2.txt')
        classifyArray,aggClassEst=adaBoostTrainDS(dataArr,labelArr,10)
        plotROC(aggClassEst.T,labelArr)
    
        # testArr,testLabelArr=loadDataSet('horseColicTest2.txt')
        # prediction10=adaClassify(testArr,classifyArray)
        # errArr=mat(ones((67,1)))
        # count=errArr[prediction10!=mat(testLabelArr).T].sum()
        # print(prediction10)
        # print(count)
    
    
        # D=mat(ones((5,1))/5)
        # datMat, classLabels=loadSimpData()
        # result=adaClassify([[5,5],[0,0]],classifyArray)
        # print('最终分类结果为:', result)
    完整代码
  • 相关阅读:
    数据库之小问题
    网络基础
    react-fiber 解析
    【like-react】手写一个类似 react 的框架
    istat menus 序列号
    Git学习
    JavaScript设计模式与开发实践【第一部分】
    javascript 原生bind方法实现
    requirejs 学习
    mac 安装maven+eclipse
  • 原文地址:https://www.cnblogs.com/zhhy236400/p/9921574.html
Copyright © 2011-2022 走看看