zoukankan      html  css  js  c++  java
  • 【Machine Learning in Action --2】K-近邻算法构造手写识别系统

      为了简单起见,这里构造的系统只能识别数字0到9,需要识别的数字已经使用图形处理软件,处理成具有相同的色彩和大小:宽高是32像素的黑白图像。尽管采用文本格式存储图像不能有效地利用内存空间,但是为了方便理解,我们还是将图像转换为文本格式。

    ---1.收集数据:提供文本文件

      该数据集合修改自“手写数字数据集的光学识别”-一文中的数据集合,该文登载于2010年10月3日的UCI机器学习资料库中http://archive.ics.uci.edu/ml。

          

    ---2.准备数据:将图像转换为测试向量

      trainingDigits中包含了大约2000个例子,每个数字大约有200个样本;testDigits中包含了大约900个测试数据。两组数据没有重叠。

      我们先将图像格式化处理为一个向量。我们将一个32*32的二进制图像矩阵转换为1*1024的向量。

      我们首先编写函数img2vector,将图像转换为向量:该函数创建1*1024的NumPy数组,然后打开指定的文件,循环读出文件的前32行,并将每行的前32个字符值存储在NumPy数组中,最后返回数组。

    #!/usr/bin/python
    # -*- coding: utf-8 -*-
    from numpy import *     #引入科学计算包numpy
    from os import listdir
    import operator         #经典python函数库,运算符模块
    
    #算法核心
    #inX:用户分类的输入向量,即将对其进行分类
    #dataSet:训练样本集
    #labels:标签向量
    def classifyO(inX,dataSet,labels,k):     
        #距离计算
        dataSetSize=dataSet.shape[0] #得到数组的行数,即知道有几个训练数据
        diffMat=tile(inX,(dataSetSize,1))-dataSet  #tile是numpy中的函数,tile将原来的一个数组,扩充成了4个一样的数组;diffMat得到目标与训练数值之间的差值
        sqDiffMat=diffMat**2         #各个元素分别平方
        sqDistances=sqDiffMat.sum(axis=1) 
        distances=sqDistances**0.5   #开方,得到距离
        sortedDistIndicies=distances.argsort()  #升序排列
        #选择距离最小的k个点
        classCount={}
        for i in range(k):
            voteIlabel=labels[sortedDistIndicies[i]]
            classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
        #排序
        sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
        return sortedClassCount[0][0]
    
    def img2vector(filename):
        returnVect=zeros((1,1024))
        fr=open(filename)
        for i in range(32):
            lineStr=fr.readline()
            for j in range(32):
                returnVect[0,32*i+j]=int(lineStr[j])
        return returnVect
    

      在python命令行中输入下列命令测试img2vector函数,然后与本文编辑器打开的文件进行比较:

    >>> import kNN
    >>> testVector=kNN.img2vector('digits/testDigits/0_13.txt') #根据自己的目录写
    >>> testVector[0,0:31]
    array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
            0.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
            0.,  0.,  0.,  0.,  0.])
    >>> testVector[0,32:63]
    array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
            1.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
            0.,  0.,  0.,  0.,  0.])

    ---3.测试算法:使用k-近邻算法识别手写数字

      我们已经将数据处理成分类器可以识别的格式,现在要做的是将这些数据输入到分类器,检查分类器的执行结果。handwritingClassTest()是测试分类器的代码,将其写入kNN.py文件中。在写入之前,保证将from os import listdir写入文件的起始部分。这段代码主要功能是从os模块中导入函数listdir,它可以列出给定目录的文件名。

    def handwritingClassTest():
        hwLabels=[]
        trainingFileList=listdir('E:\python excise\digits\trainingDigits')
        m=len(trainingFileList)
        trainingMat=zeros((m,1024))
        for i in range(m):
            fileNameStr=trainingFileList[i]
            fileStr=fileNameStr.split('.')[0]
            classNumStr=int(fileStr.split('_')[0])
            hwLabels.append(classNumStr)
            trainingMat[i,:]=img2vector('digits/trainingDigits/%s' %fileNameStr)
        testFileList=listdir('E:/python excise/digits/testDigits')
        errorCount=0.0
        mTest=len(testFileList)
        for i in range(mTest):
            fileNameStr=testFileList[i]
            fileStr=fileNameStr.split('.')[0]
            classNumStr=int(fileStr.split('_')[0])
            vectorUnderTest=img2vector('digits/testDigits/%s'%fileNameStr)
            classifierResult=classifyO(vectorUnderTest,trainingMat,hwLabels,3)
            print "the classifier came back with:%d,the real answeris:%d" %(classifierResult,classNumStr)
            if(classifierResult !=classNumStr):errorCount+=1.0
        print "
    the total number of error is:%d"%errorCount
        print "
    the total error rate is:%f"%(errorCount/float(mTest))
    

      解释:将E:\python excise\digits\trainingDigits目录中的文件内容存储到列表trainingFileList中,然后可以得到文件中有有多少文件,并将其存储在变量m中。接着,代码创建一个m行1024列的训练矩阵,该矩阵的每行数据存储一个图像。我们可以从文件名中解析出分类数字,该目录下的文件按照规则命名,如文件9_45.txt的分类是9,它是数字9的第45个实例。然后我们可以将类代码存储到hwLabels向量中,使用前面的img2vector函数载入图像。

      下一步中,对E:/python excise/digits/testDigits目录中文件执行相似的操作,不同的是我们并不将这个目录下的文件载入矩阵,而是使用classifyO()函数测试该目录下的每个文件。由于文件中的值已经在0和1之间,所以不用归一化。

      在python命令提示符中输入kNN.handwritingClassTest(),测试该函数的输出结果。依赖于机器速度,夹在数据集可能需要话费很长时间,然后函数依次测试每个文件:

    >>> kNN.handwritingClassTest()
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:0,the real answeris:0
    the classifier came back with:1,the real answeris:1
    the classifier came back with:1,the real answeris:1
    the classifier came back with:1,the real answeris:1
    the classifier came back with:1,the real answeris:1
    ...
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    the classifier came back with:9,the real answeris:9
    
    the total number of error is:11
    
    the total error rate is:0.011628
    

    总结
      k-近邻算法识别手写数字数据集,错误率为1.2%。改变变量k的值、修改函数handwritingClassTest随机选取训练样本、改变训练样本的数目,都会对k-近邻算法的错误率产生影响。

     

  • 相关阅读:
    springboot 入门
    json-lib 的 maven dependency 引入及Jar包下载
    DataGridView 使用技巧精华
    SqlServer无备份下误删数据恢复
    Reflector 已经out了,试试ILSpy
    C# 反射,动态编译
    windows8和windows server2012不联网安装.net 3.5(包括2.0和3.0)
    如何附加被分离的质疑数据库? [转]
    easyui validatebox 验证类型
    所见即所得:七大无需编程的DIY开发工具
  • 原文地址:https://www.cnblogs.com/chamie/p/4830643.html
Copyright © 2011-2022 走看看