zoukankan      html  css  js  c++  java
  • 机器学习(利用adaboost元算法提高分类性能)

    元算法背后的思路是对其他算法进行组合的一种方式,A

    from numpy import *
    
    def loadSimpData():
        datMat = matrix([[ 1. ,  2.1],
            [ 2. ,  1.1],
            [ 1.3,  1. ],
            [ 1. ,  1. ],
            [ 2. ,  1. ]])
        classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
        return datMat,classLabels
    
    def loadDataSet(fileName):      #general function to parse tab -delimited floats
        numFeat = len(open(fileName).readline().split('	')) #get number of fields 
        dataMat = []; labelMat = []
        fr = open(fileName)
        for line in fr.readlines():
            lineArr =[]
            curLine = line.strip().split('	')
            for i in range(numFeat-1):
                lineArr.append(float(curLine[i]))
            dataMat.append(lineArr)
            labelMat.append(float(curLine[-1]))
        return dataMat,labelMat
    
    def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
        retArray = ones((shape(dataMatrix)[0],1))
        if threshIneq == 'lt':
            retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
        else:
            retArray[dataMatrix[:,dimen] > threshVal] = -1.0
        return retArray
        
    
    def buildStump(dataArr,classLabels,D):
        dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
        m,n = shape(dataMatrix)
        numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))
        minError = inf #init error sum, to +infinity
        for i in range(n):#loop over all dimensions
            rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
            stepSize = (rangeMax-rangeMin)/numSteps
            for j in range(-1,int(numSteps)+1):#loop over all range in current dimension
                for inequal in ['lt', 'gt']: #go over less than and greater than
                    threshVal = (rangeMin + float(j) * stepSize)
                    predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
                    errArr = mat(ones((m,1)))
                    errArr[predictedVals == labelMat] = 0
                    weightedError = D.T*errArr  #calc total error multiplied by D
                    #print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)
                    if weightedError < minError:
                        minError = weightedError
                        bestClasEst = predictedVals.copy()
                        bestStump['dim'] = i
                        bestStump['thresh'] = threshVal
                        bestStump['ineq'] = inequal
        return bestStump,minError,bestClasEst
    
    
    def adaBoostTrainDS(dataArr,classLabels,numIt=40):
        weakClassArr = []
        m = shape(dataArr)[0]
        D = mat(ones((m,1))/m)   #init D to all equal
        aggClassEst = mat(zeros((m,1)))
        for i in range(numIt):
            bestStump,error,classEst = buildStump(dataArr,classLabels,D)#build Stump
            #print "D:",D.T
            alpha = float(0.5*log((1.0-error)/max(error,1e-16)))#calc alpha, throw in max(error,eps) to account for error=0
            bestStump['alpha'] = alpha  
            weakClassArr.append(bestStump)                  #store Stump Params in Array
            #print "classEst: ",classEst.T
            expon = multiply(-1*alpha*mat(classLabels).T,classEst) #exponent for D calc, getting messy
            D = multiply(D,exp(expon))                              #Calc New D for next iteration
            D = D/D.sum()
            #calc training error of all classifiers, if this is 0 quit for loop early (use break)
            aggClassEst += alpha*classEst
            #print "aggClassEst: ",aggClassEst.T
            aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
            errorRate = aggErrors.sum()/m
            print "total error: ",errorRate
            if errorRate == 0.0: break
        return weakClassArr,aggClassEst
    
    def adaClassify(datToClass,classifierArr):
        dataMatrix = mat(datToClass)#do stuff similar to last aggClassEst in adaBoostTrainDS
        m = shape(dataMatrix)[0]
        aggClassEst = mat(zeros((m,1)))
        for i in range(len(classifierArr)):
            classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],
                                     classifierArr[i]['thresh'],
                                     classifierArr[i]['ineq'])#call stump classify
            aggClassEst += classifierArr[i]['alpha']*classEst
            print aggClassEst
        return sign(aggClassEst)
    
    def plotROC(predStrengths, classLabels):
        import matplotlib.pyplot as plt
        cur = (1.0,1.0) #cursor
        ySum = 0.0 #variable to calculate AUC
        numPosClas = sum(array(classLabels)==1.0)
        yStep = 1/float(numPosClas); xStep = 1/float(len(classLabels)-numPosClas)
        sortedIndicies = predStrengths.argsort()#get sorted index, it's reverse
        fig = plt.figure()
        fig.clf()
        ax = plt.subplot(111)
        #loop through all the values, drawing a line segment at each point
        for index in sortedIndicies.tolist()[0]:
            if classLabels[index] == 1.0:
                delX = 0; delY = yStep;
            else:
                delX = xStep; delY = 0;
                ySum += cur[1]
            #draw line from cur to (cur[0]-delX,cur[1]-delY)
            ax.plot([cur[0],cur[0]-delX],[cur[1],cur[1]-delY], c='b')
            cur = (cur[0]-delX,cur[1]-delY)
        ax.plot([0,1],[0,1],'b--')
        plt.xlabel('False positive rate'); plt.ylabel('True positive rate')
        plt.title('ROC curve for AdaBoost horse colic detection system')
        ax.axis([0,1,0,1])
        plt.show()
        print "the Area Under the Curve is: ",ySum*xStep

    daboost是最为流行的元算法,是机器学习中最强有力的工具之一

    组合方式有不同算法之间的组合,也可以是同一算法在不同设置下的集成,还可以是数据集不同部分分配给不同分类器之后的集成

    优点:泛化错误率低,易编码,可应用于大部分的分类器上,无参数需调整

    缺点:对离群点敏感

    适用于数值型于标称型数据

    bagging是从原始数据集中选择S次后得到S个新数据集的技术,新数据集与原数据集大小相等,每个数据集都是通过原始数据集进行随机选择一个样本进行替换掉的,这一过程允许选择重复得值,而有些值则可以不出现

    在S个数据建好之后,将某个算法分别作用于每个数据集得到S个分类器,当我们对新数据进行分类的时候,可以用这S个分类器进行分类,选择分类器投票结果中最多的结果作为最后的分类结果

    比较先进的bagging方法是随机森林

    boosting是一种和bagging类似的技术,bagging是通过串行训练而获得的,boosting则是集中关注被已有分类器错分的那部分数据来获得新的分类器

    boosting的结果是通过所有分类器加权求和的结果,bagging是等权重的,boosting权重不同,每个权重代表分类器在上一轮迭代的成功度

    Adaboost就是boosting中的一种

    Adaboost算法可以简述为三个步骤:
     (1)首先,是初始化训练数据的权值分布D1。假设有N个训练样本数据,则每一个训练样本最开始时,都被赋予相同的权值:w1=1/N。
     (2)然后,训练弱分类器hi。具体训练过程中是:如果某个训练样本点,被弱分类器hi准确地分类,那么在构造下一个训练集中,它对应的权值要减小;相反,如果某个训练样本点被错误分类,那么它的权值就应该增大。权值更新过的样本集被用于训练下一个分类器,整个训练过程如此迭代地进行下去。
     (3)最后,将各个训练得到的弱分类器组合成一个强分类器。各个弱分类器的训练过程结束后,加大分类误差率小的弱分类器的权重,使其在最终的分类函数中起着较大的决定作用,而降低分类误差率大的弱分类器的权重,使其在最终的分类函数中起着较小的决定作用。
      换而言之,误差率低的弱分类器在最终分类器中占的权重较大,否则较小。

  • 相关阅读:
    HDU2586 How far away?(tarjan的LCA)
    You Raise Me Up
    POJ2891 Strange Way to Express Integers(中国剩余定理)
    POJ2142 The Balance(扩展欧几里得)
    HDU 1166模仿大牛写的线段树
    NetWord Dinic
    HDU 1754 线段树裸题
    hdu1394 Minimum Inversion Number
    hdu2795 Billboard
    【完全版】线段树
  • 原文地址:https://www.cnblogs.com/xzm123/p/8985262.html
Copyright © 2011-2022 走看看