zoukankan      html  css  js  c++  java
  • Logistic Regression 用于预测马是否生病

    1.利用Logistic regression 进行分类的主要思想

    根据现有数据对分类边界线建立回归公式,即寻找最佳拟合参数集,然后进行分类。

    2.利用梯度下降找出最佳拟合参数

    3.代码实现

      1 # -*- coding: utf-8 -*-
      2 """
      3 Created on Tue Mar 28 21:35:25 2017
      4 
      5 @author: MyHome
      6 """
      7 import numpy as np
      8 from random import uniform
      9 '''定义sigmoid函数'''
     10 def sigmoid(inX):
     11     return 1.0 /(1.0 +np.exp(-inX))
     12 
     13 '''使用随机梯度下降更新权重,并返回最终值'''
     14 def StocGradientDescent(dataMatrix,classLabels,numIter = 600):
     15     m,n = dataMatrix.shape
     16     #print m,n
     17     weights = np.ones(n)
     18     for j in xrange(numIter):
     19         dataIndex = range(m)
     20 
     21         for i in xrange(m):
     22 
     23             alpha = 4 / (1.0+j+i) + 0.01
     24             randIndex = int(uniform(0,len(dataIndex)))
     25             h = sigmoid(sum(dataMatrix[randIndex]*weights))
     26             gradient = (h - classLabels[randIndex])*dataMatrix[randIndex]
     27             weights = weights - alpha*gradient
     28             del(dataIndex[randIndex])
     29 
     30     return weights
     31 
     32 
     33 '''创建分类器'''
     34 def classifyVector(inX,weights):
     35     prob = sigmoid(sum(inX*weights))
     36     if prob > 0.5:
     37         return 1.0
     38     else:
     39         return 0.0
     40 
     41 '''测试'''
     42 def Test():
     43 
     44     frTrain = open("horseColicTraining.txt")
     45     frTest = open("horseColicTest.txt")
     46     trainingSet = []
     47     trainingLabel = []
     48     for line in frTrain.readlines():
     49         currLine = line.strip().split("	")
     50         lineArr = []
     51         for i in range(21):
     52             lineArr.append(float(currLine[i]))
     53         trainingSet.append(lineArr)
     54         trainingLabel.append(float(currLine[21]))
     55     trainWeights = StocGradientDescent(np.array(trainingSet),trainingLabel)
     56     errorCount = 0.0
     57     numTestVec = 0.0
     58     for line in frTest.readlines():
     59         numTestVec += 1.0
     60         currLine = line.strip().split("	")
     61         lineArr = []
     62         for i in range(21):
     63             lineArr.append(float(currLine[i]))
     64         if int(classifyVector(np.array(lineArr),trainWeights)) != int(currLine[21]):
     65             errorCount += 1
     66     errorRate = (float(errorCount)/numTestVec)
     67     print "the error rate of this test is:%f"%errorRate
     68     return errorRate
     69 
     70 '''调用Test()10次求平均值'''
     71 def multiTest():
     72     numTest = 10
     73     errorSum = 0.0
     74     for k in range(numTest):
     75         errorSum += Test()
     76     print "after %d iterations the average errror rate is:
     77         %f"%(numTest,errorSum/float(numTest))
     78 
     79 if __name__ == "__main__":
     80     multiTest()

    结果:

    the error rate of this test is:0.522388
    the error rate of this test is:0.328358
    the error rate of this test is:0.313433
    the error rate of this test is:0.358209
    the error rate of this test is:0.298507
    the error rate of this test is:0.343284
    the error rate of this test is:0.283582
    the error rate of this test is:0.313433
    the error rate of this test is:0.343284
    the error rate of this test is:0.358209
    after 10 iterations the average errror rate is:        0.346269

    4.总结

    Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid.
    Methods of optimization can be used to find the best-fit parameters. Among the
    optimization algorithms, one of the most common algorithms is gradient descent. Gradient
    desent can be simplified with stochastic gradient descent.
    Stochastic gradient descent can do as well as gradient descent using far fewer computing
    resources. In addition, stochastic gradient descent is an online algorithm; it can
    update what it has learned as new data comes in rather than reloading all of the data
    as in batch processing.
    One major problem in machine learning is how to deal with missing values in the
    data. There’s no blanket answer to this question. It really depends on what you’re
    doing with the data. There are a number of solutions, and each solution has its own
    advantages and disadvantages.

  • 相关阅读:
    pyqt(一)安装及配置。
    systemctl centos fedora 用法
    onedriver -1T容量,edu邮箱申请。
    linux fdisk 添加硬盘,分区,挂载,永久挂载
    ssh的配置,ssh打开密钥登陆,关闭密码登陆。
    基金分仓
    基金交易席位的制度沿袭
    券商VIP交易通道
    解密中国证券金融股份有限公司
    光大“乌龙指”24小时
  • 原文地址:https://www.cnblogs.com/lpworkstudyspace1992/p/6639120.html
Copyright © 2011-2022 走看看