1.利用Logistic regression 进行分类的主要思想
根据现有数据对分类边界线建立回归公式,即寻找最佳拟合参数集,然后进行分类。
2.利用梯度下降找出最佳拟合参数
3.代码实现
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Tue Mar 28 21:35:25 2017 4 5 @author: MyHome 6 """ 7 import numpy as np 8 from random import uniform 9 '''定义sigmoid函数''' 10 def sigmoid(inX): 11 return 1.0 /(1.0 +np.exp(-inX)) 12 13 '''使用随机梯度下降更新权重,并返回最终值''' 14 def StocGradientDescent(dataMatrix,classLabels,numIter = 600): 15 m,n = dataMatrix.shape 16 #print m,n 17 weights = np.ones(n) 18 for j in xrange(numIter): 19 dataIndex = range(m) 20 21 for i in xrange(m): 22 23 alpha = 4 / (1.0+j+i) + 0.01 24 randIndex = int(uniform(0,len(dataIndex))) 25 h = sigmoid(sum(dataMatrix[randIndex]*weights)) 26 gradient = (h - classLabels[randIndex])*dataMatrix[randIndex] 27 weights = weights - alpha*gradient 28 del(dataIndex[randIndex]) 29 30 return weights 31 32 33 '''创建分类器''' 34 def classifyVector(inX,weights): 35 prob = sigmoid(sum(inX*weights)) 36 if prob > 0.5: 37 return 1.0 38 else: 39 return 0.0 40 41 '''测试''' 42 def Test(): 43 44 frTrain = open("horseColicTraining.txt") 45 frTest = open("horseColicTest.txt") 46 trainingSet = [] 47 trainingLabel = [] 48 for line in frTrain.readlines(): 49 currLine = line.strip().split(" ") 50 lineArr = [] 51 for i in range(21): 52 lineArr.append(float(currLine[i])) 53 trainingSet.append(lineArr) 54 trainingLabel.append(float(currLine[21])) 55 trainWeights = StocGradientDescent(np.array(trainingSet),trainingLabel) 56 errorCount = 0.0 57 numTestVec = 0.0 58 for line in frTest.readlines(): 59 numTestVec += 1.0 60 currLine = line.strip().split(" ") 61 lineArr = [] 62 for i in range(21): 63 lineArr.append(float(currLine[i])) 64 if int(classifyVector(np.array(lineArr),trainWeights)) != int(currLine[21]): 65 errorCount += 1 66 errorRate = (float(errorCount)/numTestVec) 67 print "the error rate of this test is:%f"%errorRate 68 return errorRate 69 70 '''调用Test()10次求平均值''' 71 def multiTest(): 72 numTest = 10 73 errorSum = 0.0 74 for k in range(numTest): 75 errorSum += Test() 76 print "after %d iterations the average errror rate is: 77 %f"%(numTest,errorSum/float(numTest)) 78 79 if __name__ == "__main__": 80 multiTest()
结果:
the error rate of this test is:0.522388
the error rate of this test is:0.328358
the error rate of this test is:0.313433
the error rate of this test is:0.358209
the error rate of this test is:0.298507
the error rate of this test is:0.343284
the error rate of this test is:0.283582
the error rate of this test is:0.313433
the error rate of this test is:0.343284
the error rate of this test is:0.358209
after 10 iterations the average errror rate is: 0.346269
4.总结
Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid.
Methods of optimization can be used to find the best-fit parameters. Among the
optimization algorithms, one of the most common algorithms is gradient descent. Gradient
desent can be simplified with stochastic gradient descent.
Stochastic gradient descent can do as well as gradient descent using far fewer computing
resources. In addition, stochastic gradient descent is an online algorithm; it can
update what it has learned as new data comes in rather than reloading all of the data
as in batch processing.
One major problem in machine learning is how to deal with missing values in the
data. There’s no blanket answer to this question. It really depends on what you’re
doing with the data. There are a number of solutions, and each solution has its own
advantages and disadvantages.