zoukankan      html  css  js  c++  java
  • 机器学习实战之KNN

    KNN也称K-近邻算法,简单来说,KNN采用测量不同特征值之间的距离的方法进行分类。

    优点:精度高,对异常值不敏感,无数据输入假定。

    确定:时间复杂度、空间复杂度较高

    适用数据范围:数值型和标称型

    工作原理简介:存在一个样本数据集合,并且每个样本数据集中都存在标签。输入没有标签的数据集之后,将新数据集的每个特征与样本集中数据对应的特征进行比较,然后算法提取样本集中特征最相似数据(最近邻)的分类标签。一般来说,我们只选择样本中钱K个最相似的数据,这就是K-近邻中K的出处,通常K是不大于20的整数。最后,选择K个最相似数据中出现次数最多的分类,作为新数据集的分类标签。

    示例1:使用KNN改进约会网站的配对效果

    labels:不喜欢的人

         魅力一般的人

         极具魅力的人

    Feature:每年获得的飞行常客里程数

        玩视频游戏所消耗的时间百分比

        每周消耗的冰淇淋公升数

    示例代码如下:

    import numpy as np
    import operator
    import matplotlib
    import matplotlib.pyplot as plt
    
    #KNN 分类器
    def classify0(inX, dataSet, labels, k):
        dataSetSize = dataSet.shape[0]
        diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet #复制inX为(dataSetSize, 1)
        sqDiffMat = diffMat ** 2
        sqDistances = sqDiffMat.sum(axis = 1)
        distances = sqDistances ** 0.5
        sortedDistIndicies = distances.argsort() #按照value排序,并且返回索引
        classCount = {}
        for i in range(k):
            voteIlabel = labels[sortedDistIndicies[i]]
            classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
        
        sortedClassCount = sorted(classCount.items(), key = lambda item: item[1] ,reverse = True)
        
        return sortedClassCount[0][0]
    
    #将文件中数据转化为矩阵数据
    def file2matrix(fileName): fr = open(fileName) arrayAllLines = fr.readlines() numberOfLines = len(arrayAllLines) returnMat = np.zeros((numberOfLines, 3)) classLabelVector = [] index = 0 for line in arrayAllLines: line.strip() #默认删除此行开头和结尾的空格和换行 listFromLine = line.split(' ') returnMat[index,:] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index += 1 return returnMat, classLabelVector
    #归一化
    def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals row = dataSet.shape[0] normDataSet = dataSet - np.tile(minVals, (row, 1)) normDataSet = normDataSet / np.tile(ranges, (row, 1)) return normDataSet, ranges, minVals
    #随机选取10%的数据进行测试
    def datingClassTest(): hoRatio = 0.10 pathName = "./datingTestSet2.txt" datingDataMat, datingLabels = file2matrix(pathName) normMat, ranges, minVals = autoNorm(datingDataMat) row = normMat.shape[0] numTestVecs = int(hoRatio * row) errorCount = 0.0 for i in range(numTestVecs): classifierResult = classify0(normMat[i,:], normMat[numTestVecs:row,:], datingLabels[numTestVecs:row], 5) print("The classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])) if(classifierResult != datingLabels[i]): errorCount += 1.0 print("The total error tate is : %f" % (errorCount / float(numTestVecs))) datingClassTest()

    示例2:手写数字识别系统

    import numpy as np
    import operator
    import matplotlib
    import matplotlib.pyplot as plt
    from os import listdir
    

    #KNN 分类器 def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet #复制inX为(dataSetSize, 1) sqDiffMat = diffMat ** 2 sqDistances = sqDiffMat.sum(axis = 1) distances = sqDistances ** 0.5 sortedDistIndicies = distances.argsort() #按照value排序,并且返回索引 classCount = {} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 sortedClassCount = sorted(classCount.items(), key = lambda item: item[1] ,reverse = True) return sortedClassCount[0][0] #将文本文件转化为向量 def img2vector(fileName): returnVect = np.zeros((1,32 * 32)) fr = open(fileName) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,i * 32 + j] = int(lineStr[j]) return returnVect #手写数字测试错误率 def handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits') m = len(trainingFileList) trainingMat = np.zeros((m, 32 * 32)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] classNum = int(fileStr.split('_')[0]) hwLabels.append(classNum) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) testFileList = listdir('testDigits') m = len(testFileList) errorCount = 0.0 for i in range(m): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] realClassNum = int(fileStr.split('_')[0]) testVect = img2vector('testDigits/%s' % fileNameStr) testClassNum = classify0(testVect, trainingMat, hwLabels, 3) print("The classifier came back with: %d, the real answer is: %d" % (testClassNum, realClassNum)) if(testClassNum != realClassNum): errorCount += 1.0 print("The total error rate is: %f" % (errorCount / float(m))) handwritingClassTest()

     数据集下载以及完整jupyter notebook 代码下载:

    https://github.com/qwqwqw110/machineLearningInactionCode/tree/master/KNN

  • 相关阅读:
    (转)多线程同步event
    初始化列表中成员列出的顺序和它们在类中声明的顺序相同
    确定基类有虚析构函数
    (转)list::splice()函数详解
    MANIFEST.MF文件的格式
    NIO入门了解Buffer
    Failed to load class "org.slf4j.impl.StaticLoggerB
    线程挂起自己,让出CPU
    database如何管理超过4GB的文件
    线程同步(C# 编程指南)
  • 原文地址:https://www.cnblogs.com/qiang-wei/p/10711389.html
Copyright © 2011-2022 走看看