zoukankan      html  css  js  c++  java
  • Logistic Regression 用于预测马是否生病

    1.利用Logistic regression 进行分类的主要思想

    根据现有数据对分类边界线建立回归公式,即寻找最佳拟合参数集,然后进行分类。

    2.利用梯度下降找出最佳拟合参数

    3.代码实现

      1 # -*- coding: utf-8 -*-
      2 """
      3 Created on Tue Mar 28 21:35:25 2017
      4 
      5 @author: MyHome
      6 """
      7 import numpy as np
      8 from random import uniform
      9 '''定义sigmoid函数'''
     10 def sigmoid(inX):
     11     return 1.0 /(1.0 +np.exp(-inX))
     12 
     13 '''使用随机梯度下降更新权重,并返回最终值'''
     14 def StocGradientDescent(dataMatrix,classLabels,numIter = 600):
     15     m,n = dataMatrix.shape
     16     #print m,n
     17     weights = np.ones(n)
     18     for j in xrange(numIter):
     19         dataIndex = range(m)
     20 
     21         for i in xrange(m):
     22 
     23             alpha = 4 / (1.0+j+i) + 0.01
     24             randIndex = int(uniform(0,len(dataIndex)))
     25             h = sigmoid(sum(dataMatrix[randIndex]*weights))
     26             gradient = (h - classLabels[randIndex])*dataMatrix[randIndex]
     27             weights = weights - alpha*gradient
     28             del(dataIndex[randIndex])
     29 
     30     return weights
     31 
     32 
     33 '''创建分类器'''
     34 def classifyVector(inX,weights):
     35     prob = sigmoid(sum(inX*weights))
     36     if prob > 0.5:
     37         return 1.0
     38     else:
     39         return 0.0
     40 
     41 '''测试'''
     42 def Test():
     43 
     44     frTrain = open("horseColicTraining.txt")
     45     frTest = open("horseColicTest.txt")
     46     trainingSet = []
     47     trainingLabel = []
     48     for line in frTrain.readlines():
     49         currLine = line.strip().split("	")
     50         lineArr = []
     51         for i in range(21):
     52             lineArr.append(float(currLine[i]))
     53         trainingSet.append(lineArr)
     54         trainingLabel.append(float(currLine[21]))
     55     trainWeights = StocGradientDescent(np.array(trainingSet),trainingLabel)
     56     errorCount = 0.0
     57     numTestVec = 0.0
     58     for line in frTest.readlines():
     59         numTestVec += 1.0
     60         currLine = line.strip().split("	")
     61         lineArr = []
     62         for i in range(21):
     63             lineArr.append(float(currLine[i]))
     64         if int(classifyVector(np.array(lineArr),trainWeights)) != int(currLine[21]):
     65             errorCount += 1
     66     errorRate = (float(errorCount)/numTestVec)
     67     print "the error rate of this test is:%f"%errorRate
     68     return errorRate
     69 
     70 '''调用Test()10次求平均值'''
     71 def multiTest():
     72     numTest = 10
     73     errorSum = 0.0
     74     for k in range(numTest):
     75         errorSum += Test()
     76     print "after %d iterations the average errror rate is:
     77         %f"%(numTest,errorSum/float(numTest))
     78 
     79 if __name__ == "__main__":
     80     multiTest()

    结果:

    the error rate of this test is:0.522388
    the error rate of this test is:0.328358
    the error rate of this test is:0.313433
    the error rate of this test is:0.358209
    the error rate of this test is:0.298507
    the error rate of this test is:0.343284
    the error rate of this test is:0.283582
    the error rate of this test is:0.313433
    the error rate of this test is:0.343284
    the error rate of this test is:0.358209
    after 10 iterations the average errror rate is:        0.346269

    4.总结

    Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid.
    Methods of optimization can be used to find the best-fit parameters. Among the
    optimization algorithms, one of the most common algorithms is gradient descent. Gradient
    desent can be simplified with stochastic gradient descent.
    Stochastic gradient descent can do as well as gradient descent using far fewer computing
    resources. In addition, stochastic gradient descent is an online algorithm; it can
    update what it has learned as new data comes in rather than reloading all of the data
    as in batch processing.
    One major problem in machine learning is how to deal with missing values in the
    data. There’s no blanket answer to this question. It really depends on what you’re
    doing with the data. There are a number of solutions, and each solution has its own
    advantages and disadvantages.

  • 相关阅读:
    Silverlight实用窍门系列:68.Silverlight的资源字典ResourceDictionary
    在HyperlinkButton的URL地址里附加多个参数(以http get的方式)
    SilverLight CheckBox 控件 DataContext属性与DataContextChanged事件
    关于事件在意料之外触发的问题
    关于Telerik RadGridView 数据列拖动后异常的一种情况
    silverlight使用小计(先做记录后续整理)
    Redis_简介和安装
    Python并行系统工具_multiprocessing模块
    Python并行系统工具_程序退出和进程间通信
    Python并行系统工具_进程分支
  • 原文地址:https://www.cnblogs.com/lpworkstudyspace1992/p/6639120.html
Copyright © 2011-2022 走看看