zoukankan      html  css  js  c++  java
  • k-近邻算法(KNN)识别手写数字

    k-近邻算法(KNN)

    目录 trainingDigits 中包含了大约 2000 个例子,每个例子内容如下图所示,每个数字大约有 200 个样本;目录 testDigits 中包含了大约 900 个测试数据。

    将一个32x32的二进制图像矩阵转化为1x1024的向量。

    函数img2vector,将图像转化为向量,该函数创建1x1024的数组,然后打开给定的文件,循环读出文件的前32行,并将每行的头32个字值存储在NumPy数组种,最后返回数组。

    #将图像文本数据转换为向量
    def img2vector(filename):
        returnVect = zeros((1,1024))
        fr = open(filename)
        for i in range(32):
            lineStr = fr.readline()
            for j in range(32):
                returnVect[0,32*i+j] = int(lineStr[j])
        return returnVect
    

    将这些数据输入到分类器,检测分类器的执行效果。

    #测试算法
    def handwritingClassTest():
        hwLabels = []
        trainingFileList = listdir('trainingDigits')           #加载训练集
        m = len(trainingFileList)
        trainingMat = zeros((m,1024))
        for i in range(m):
            fileNameStr = trainingFileList[i]
            fileStr = fileNameStr.split('.')[0]     
            classNumStr = int(fileStr.split('_')[0])
            hwLabels.append(classNumStr)
            trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)
        testFileList = listdir('testDigits')        #遍历
        errorCount = 0.0
        mTest = len(testFileList)
        for i in range(mTest):
            fileNameStr = testFileList[i]
            fileStr = fileNameStr.split('.')[0]     
            classNumStr = int(fileStr.split('_')[0])
            vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
            classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
            print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
            if (classifierResult != classNumStr): errorCount += 1.0
        print "
    the total number of errors is: %d" % errorCount
        print "
    the total error rate is: %f" % (errorCount/float(mTest))
    

     测试算法:编写函数使用提供的部分数据集作为测试样本,如果预测分类与实际类别不同,则标记为一个错误

    classify0)()函数有4个参数:用于分类的输入向量是inX,训练集为dataSet,标签向量为labels,,k表示用于选择最近邻居的数目,其中标签向量的元素数目和矩阵dataSet的行数相同。

    def classify0(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = tile(inX, (dataSetSize,1)) - dataSet   #把inX二维数组化,dataSetSize表示生成数组后的行数,1表示列的倍数。实现了矩阵之间的减法。
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis=1)。#axis=1:参数等于1,矩阵中行之间的数的求和
        distances = sqDistances**0.5
        sortedDistIndicies = distances.argsort()  #argsort():对一个数组进行非降序排序   
        classCount={}          
        for i in range(k):
            voteIlabel = labels[sortedDistIndicies[i]]
            #访问下标键为voteIlabel的项
            classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
    

     代码

    from numpy import *
    import operator
    from os import listdir
    
    def classify0(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = tile(inX, (dataSetSize,1)) - dataSet   #把inX二维数组化,dataSetSize表示生成数组后的行数,1表示列的倍数。实现了矩阵之间的减法。
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis=1)。#axis=1:参数等于1,矩阵中行之间的数的求和
        distances = sqDistances**0.5
        sortedDistIndicies = distances.argsort()  #argsort():对一个数组进行非降序排序   
        classCount={}          
        for i in range(k):
            voteIlabel = labels[sortedDistIndicies[i]]
            #访问下标键为voteIlabel的项
            classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
        sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
    
    
    #将图像文本数据转换为向量
    def img2vector(filename):
        returnVect = zeros((1,1024))
        fr = open(filename)
        for i in range(32):
            lineStr = fr.readline()
            for j in range(32):
                returnVect[0,32*i+j] = int(lineStr[j])
        return returnVect
    
    
    #测试算法
    def handwritingClassTest():
        hwLabels = []
        trainingFileList = listdir('trainingDigits')           #加载训练集
        m = len(trainingFileList)
        trainingMat = zeros((m,1024))
        for i in range(m):
            fileNameStr = trainingFileList[i]
            fileStr = fileNameStr.split('.')[0]     
            classNumStr = int(fileStr.split('_')[0])
            hwLabels.append(classNumStr)
            trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)
        testFileList = listdir('testDigits')        #遍历
        errorCount = 0.0
        mTest = len(testFileList)
        for i in range(mTest):
            fileNameStr = testFileList[i]
            fileStr = fileNameStr.split('.')[0]     
            classNumStr = int(fileStr.split('_')[0])
            vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
            classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
            print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
            if (classifierResult != classNumStr): errorCount += 1.0
        print "
    the total number of errors is: %d" % errorCount
        print "
    the total error rate is: %f" % (errorCount/float(mTest))
    

     运行:

    >>> import kNN
    >>> kNN.handwritingClassTest()
    the classifier came back with: 4, the real answer is: 4
    the classifier came back with: 4, the real answer is: 4
    .
    .
    .
    the classifier came back with: 3, the real answer is: 3
    
    the total number of errors is: 11
    
    the total error rate is: 0.011628
    
  • 相关阅读:
    eslint 的 env 配置是干嘛使的?
    cookie httpOnly 打勾
    如何定制 antd 的样式(theme)
    剑指 Offer 66. 构建乘积数组
    剑指 Offer 65. 不用加减乘除做加法
    剑指 Offer 62. 圆圈中最后剩下的数字
    剑指 Offer 61. 扑克牌中的顺子
    剑指 Offer 59
    剑指 Offer 58
    剑指 Offer 58
  • 原文地址:https://www.cnblogs.com/wanglinjie/p/11600922.html
Copyright © 2011-2022 走看看