zoukankan      html  css  js  c++  java
  • 机器学习(logistics回归)

    这篇中我们将首次接触到最优化算法。

    logistics回归进行分类的主要思想是:根据现有数据对分类边界线建立回归公式,以此进行分类。这个分类的边界就是我们所求的回归函数。

    回归一词源于最佳拟合,表示要找到最佳拟合参数,使用的是最优化算法。回归函数就是确定最佳回归参数,然后对不同的特征赋予不同的权重

    优点:计算代价不高,易于理解和实现

    缺点:容易欠拟合,分类精度不高

    适用用标称型与数值型数据

    算法基础

    所采用的的映射函数是Sigmoid函数,Sigmoid函数比0-1函数好的一点是在局部上看是平滑的,而整体上看是近似跳跃的,而0-1函数本身是跳跃的,这个瞬间跳跃过程很难处理,不够平滑,误差较大

    为了实现logistics回归分类器,我们可以在每一个特征上都乘以一个回归系数,然后把所有结果的值相加,将这个总和带入Sigmoid函数,结果比0.5大就分入1类,比0.5小就分入0类,因此该分类方法也是一种概率估计

    最佳回归系数的确定方法

    1.梯度上升法,该法是用来求函数最大值的,常说的梯度下降法是用来求函数最小值

    2.所谓的梯度其实是数学意义中的导数,也是数据变化最大的方向,一般用倒三角符号来表示梯度

    3.公式为 w= w+ a.tidu(f(w)),其中a是步长,该公式会一直迭代到某一个值,或者达到误差允许的范围

    from numpy import *
    
    
    def loadDataSet():
        dataMat = []; labelMat = []
        fr = open('testSet.txt')
        for line in fr.readlines():
            lineArr = line.strip().split()
            dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
            labelMat.append(int(lineArr[2]))
        return dataMat,labelMat
    
    
    
    def sigmoid(inX):
        return 1.0/(1+exp(-inX))
    
    
    
    
    def gradAscent(dataMatIn, classLabels):
        dataMatrix = mat(dataMatIn)             #convert to NumPy matrix
        labelMat = mat(classLabels).transpose() #convert to NumPy matrix
        m,n = shape(dataMatrix)
        alpha = 0.001
        maxCycles = 500
        weights = ones((n,1))
        for k in range(maxCycles):              #heavy on matrix operations
            h = sigmoid(dataMatrix*weights)     #matrix mult
            error = (labelMat - h)              #vector subtraction
            weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
        return weights
    
    
    
    
        
    def plotBestFit(weights):
        import matplotlib.pyplot as plt
        dataMat,labelMat=loadDataSet()
        dataArr = array(dataMat)
        n = shape(dataArr)[0] 
        xcord1 = []; ycord1 = []
        xcord2 = []; ycord2 = []
        for i in range(n):
            if int(labelMat[i])== 1:
                xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
            else:
                xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
        fig = plt.figure()
        ax = fig.add_subplot(111)
        ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
        ax.scatter(xcord2, ycord2, s=30, c='green')
        x = arange(-3.0, 3.0, 0.1)
        y = (-weights[0]-weights[1]*x)/weights[2]
        ax.plot(x, y)
        plt.xlabel('X1'); plt.ylabel('X2');
        plt.show()
    
    def stocGradAscent0(dataMatrix, classLabels):
        m,n = shape(dataMatrix)
        alpha = 0.01
        weights = ones(n)   #initialize to all ones
        for i in range(m):
            h = sigmoid(sum(dataMatrix[i]*weights))
            error = classLabels[i] - h
            weights = weights + alpha * error * dataMatrix[i]
        return weights
    
    def stocGradAscent1(dataMatrix, classLabels, numIter=150):
        m,n = shape(dataMatrix)
        weights = ones(n)   #initialize to all ones
        for j in range(numIter):
            dataIndex = range(m)
            for i in range(m):
                alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not 
                randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant
                h = sigmoid(sum(dataMatrix[randIndex]*weights))
                error = classLabels[randIndex] - h
                weights = weights + alpha * error * dataMatrix[randIndex]
                del(dataIndex[randIndex])
        return weights
    
    def classifyVector(inX, weights):
        prob = sigmoid(sum(inX*weights))
        if prob > 0.5: return 1.0
        else: return 0.0
    
    def colicTest():
        frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
        trainingSet = []; trainingLabels = []
        for line in frTrain.readlines():
            currLine = line.strip().split('	')
            lineArr =[]
            for i in range(21):
                lineArr.append(float(currLine[i]))
            trainingSet.append(lineArr)
            trainingLabels.append(float(currLine[21]))
        trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000)
        errorCount = 0; numTestVec = 0.0
        for line in frTest.readlines():
            numTestVec += 1.0
            currLine = line.strip().split('	')
            lineArr =[]
            for i in range(21):
                lineArr.append(float(currLine[i]))
            if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
                errorCount += 1
        errorRate = (float(errorCount)/numTestVec)
        print "the error rate of this test is: %f" % errorRate
        return errorRate
    
    def multiTest():
        numTests = 10; errorSum=0.0
        for k in range(numTests):
            errorSum += colicTest()
        print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))
            
  • 相关阅读:
    vue2.0装jquery
    js new运算符
    Ajax请求Spring Mvc 时总是返回 302 Moved Temporarily
    工作笔记 --->新疆统计分析添加市场管理员相关功能笔记
    第十章:避免活跃性危险——Java并发编程实战
    第八章:线程池的使用——Java并发编程实战
    中介者模式——HeadFirst设计模式学习笔记
    解释器模式——HeadFirst设计模式学习笔记
    第七章:取消与关闭——Java并发编程实战
    蝇量模式——HeadFirst设计模式学习笔记
  • 原文地址:https://www.cnblogs.com/xzm123/p/8984119.html
Copyright © 2011-2022 走看看