zoukankan      html  css  js  c++  java
  • ML--k近邻算法

    ML–k近邻算法

    本节内容

    • k近邻分类算法
    • 从文本文件中解析和导入数据
    • 使用Matplotlib创建扩散图
    • 归一化数值


    一.K近邻算法概述

    简单地说,k近邻算法采用测量不同特征值之间的距离方法进行分类

    k近邻算法

    优点:精度高,对异常值不敏感,无数据输入假定

    缺点:计算复杂度高,空间复杂度高

    适用数据范围:数值型和标称型

    使用k近邻算法分类爱情片和动作片,根据电影的打斗镜头和接吻镜头,确定是爱情片还是动作片?

    from IPython.display import Image
    Image(filename="./data/2_1.png",width=500)
    

    output_6_0.png

    首先我们需要知道这个未知电影存在多少个打斗镜头和接吻镜头,"?"是该未知电影出现的镜头数图形化展示

    电影名称 打斗镜头 接吻镜头 电影类型
    California Man 3 104 爱情片
    He’s Not Really into Dudes 2 100 爱情片
    Beautiful Woman 1 81 爱情片
    Kevin Longblade 101 10 动作片
    Robo Slayer 3000 99 5 动作片
    Amped II 98 2 动作片
    ? 18 90 未知

    即使不知道未知电影属于哪种类型,我们也可以通过某种方法计算出来.首先计算未知电影与样本集中其他电影的距离

    电影名称 与未知电影的距离
    Cafifornia Man 20.5
    He’s Not Really into Dudes 18.7
    Beautiful Woman 19.2
    Kevin Longblade 115.3
    Robo Slayer 3000 117.4
    Amped II 118.9

    现在我们得到了样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到k个距离最近的电影.假定k=3则三个最靠近的电影依次是He’s Not Really into Dudes,Beautiful WomanCalifornia Man.k近邻算法按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片

    k近邻算法的一般流程

    1. 收集数据:可以使用任何方法
    2. 准备数据:距离计算所需要的数值
    3. 分析数据:可以使用任何方法
    4. 训练算法:此步骤不适用于k近邻算法
    5. 测试算法:计算错误率
    6. 使用算法:首先需要输入样本数据和结构化的输出结果,然后运行k近邻算法判定输入数据分别属于哪个分类,最后应用对计算出的分类执行后续的处理


    1.准备:使用python导入数据

    import numpy as np
    import operator
    
    def createDataSet():
        dataset=np.array([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2]])
        labels=["爱情片","爱情片","爱情片","动作片","动作片","动作片"]
        return dataset,labels
    
    dataset,labels=createDataSet()
    
    dataset
    
    array([[  3, 104],
           [  2, 100],
           [  1,  81],
           [101,  10],
           [ 99,   5],
           [ 98,   2]])
    
    labels
    
    ['爱情片', '爱情片', '爱情片', '动作片', '动作片', '动作片']
    

    向量labels包含了每个数据点的标签信息,labels包含的元素个数等于dataset矩阵行行数.红色点是爱情片,蓝色点是动作片

    %matplotlib inline
    import matplotlib
    import matplotlib.pyplot as plt
    
    plt.plot([3,2,1],[104,100,81],"ro",[101,99,98],[10,5,2],"b^")
    
    [<matplotlib.lines.Line2D at 0x2075b9f8358>,
     <matplotlib.lines.Line2D at 0x2075b9f8470>]
    

    output_19_1.png


    2.实施KNN分类算法

    对未知类比属性的数据集中的每个点依次执行以下操作:

    1. 计算已知类别数据集中的每个点依次执行以下操作
    2. 按照距离递增次序排序
    3. 选取与当前点距离最小的k个点
    4. 确定前k个点所在类别的出现频率
    5. 返回前k个点出现频率最高的类别作为当前点的预测分类
    def classMovieTest(X,dataset,labels,k):
        """
        :param x: 用于分类的输入向量
        :param dataset: 输入的训练样本集
        :param labels: 标签向量
        :param k: 用于选择最近邻居的数目
        :return: 分类标签;与已知样本的距离
        """
        
        # 距离计算
        datasetSize=dataset.shape[0]
        datasetMat=np.tile(X,(datasetSize,1))-dataset
        sqdatasetMat=datasetMat**2
        sqDistances=sqdatasetMat.sum(axis=1)
        distances=sqDistances**0.5
        sortDistIndicies=distances.argsort()
        classcount={}
        for i in range(k):
            voteLabel=labels[sortDistIndicies[i]]
            # 选择距离最小的 k个点
            classcount[voteLabel]=classcount.get(voteLabel,0)+1
            
        # 排序
        sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
        return sortClasscount[0][0],distances
    

    预测数据所在分类,输入X=[18,90],其输出结果应该与上面分析一致

    classMovieTest([18,90],dataset,labels,3)
    
    ('爱情片', array([ 20.51828453,  18.86796226,  19.23538406, 115.27792503,
            117.41379817, 118.92854998]))
    


    二.使用k近邻算法改进约会网站的配对效果

    三种类型的人:

    • 不喜欢的人
    • 魅力一般的人
    • 极具魅力的人


    1.准备数据:从文本文件中解析数据

    数据放在文本文件datingTestSet2.txt中,每个样本数据占据一行,总共有1000行.样本主要包含以下3种特征:

    1. 每年获得的飞行常客里程数
    2. 玩视频游戏所耗时间百分比
    3. 每周消费的冰淇淋公升数

    创建名为fileTmatrix的函数,以此来处理输入格式问题.该函数的输入为文件名字符串,输出为训练样本矩阵和类标签向量

    def fileTmatrix(filename):
        """
        :param filename: 数据集文件名
        :return: 训练数据矩阵;类标签向量
        """
        fr=open(filename)
        arrayLines=fr.readlines()
        
        # 得到文件行数
        numberLines=len(arrayLines)
        
        # 创建返回的Numpy矩阵
        datasetMat=np.zeros((numberLines,3))
        classLabelVector=[]
        index=0
        
        # 解析文件数据到列表
        for line in arrayLines:
            line=line.strip()
            listFromLine=line.split("	")
            datasetMat[index,:]=listFromLine[0:3]
            classLabelVector.append(int(listFromLine[-1]))
            index+=1
        return datasetMat,classLabelVector
    
    dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
    
    dataMat
    
    array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
           [1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
           [2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
           ...,
           [2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
           [4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
           [4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])
    
    dataLabels[0:20]
    
    [3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]
    


    2.分析数据:使用Matplotlib创建散点图

    %matplotlib inline
    import matplotlib
    import matplotlib.pyplot as plt
    
    plt.plot(dataMat[:,1],dataMat[:,2],"bo")
    
    plt.xlabel("Percentage of Time Spent Playing Video Games")
    plt.ylabel("Liters of ice cream consumed per week")
    
    plt.show()
    

    output_36_0.png

    Matplotlib库提供的scatter函数支持个性化标记散点图上的点

    fig=plt.figure()
    ax=fig.add_subplot(111)
    ax.scatter(dataMat[:,1],dataMat[:,2],15.0*np.array(dataLabels),15.0*np.array(dataLabels))
    
    <matplotlib.collections.PathCollection at 0x2075c05ea58>
    

    output_38_1.png
    使用数据矩阵dataMat的第一和第二列属性却可以得到更好的效果,图中清晰地标识了三个不同的样本分类区域,具有不同爱好的人其类别区域也不同

    fig=plt.figure()
    ax=fig.add_subplot(111)
    ax.scatter(dataMat[:,0],dataMat[:,1],15.0*np.array(dataLabels),15.0*np.array(dataLabels))
    
    <matplotlib.collections.PathCollection at 0x2075d1d50b8>
    

    output_40_1.png


    3.准备数据:归一化数值

    将取值范围的特征值转化为0到1区间内的值:

    newValue=(oldValue-min)/(max-min)

    使用函数Norm将数字特征值转化为0到1的区间

    def Norm(dataset):
        """
        :param dataset: 数据集
        :return: 归一化数据集;极值差;最小值
        """
        
        # 参数0使得函数可以从列中选取最小值
        minVal=dataset.min(0)
        maxVal=dataset.max(0)
        ranges=maxVal-minVal
        normDataset=np.zeros(np.shape(dataset))
        m=dataset.shape[0]
        normDataset=dataset-np.tile(minVal,(m,1))
        
        # 特征值相除
        normDataset=normDataset/np.tile(ranges,(m,1))
        return normDataset,ranges,minVal
    
    normMat,ranges,minVal=Norm(dataMat)
    
    normMat
    
    array([[0.44832535, 0.39805139, 0.56233353],
           [0.15873259, 0.34195467, 0.98724416],
           [0.28542943, 0.06892523, 0.47449629],
           ...,
           [0.29115949, 0.50910294, 0.51079493],
           [0.52711097, 0.43665451, 0.4290048 ],
           [0.47940793, 0.3768091 , 0.78571804]])
    
    ranges
    
    array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])
    
    minVal
    
    array([0.      , 0.      , 0.001156])
    


    4.测试算法:作为完整程序验证分类器

    def classMovieTest(X,dataset,labels,k):
        """
        :param x: 用于分类的输入向量
        :param dataset: 输入的训练样本集
        :param labels: 标签向量
        :param k: 用于选择最近邻居的数目
        :return: 分类标签
        """
        
        # 距离计算
        datasetSize=dataset.shape[0]
        datasetMat=np.tile(X,(datasetSize,1))-dataset
        sqdatasetMat=datasetMat**2
        sqDistances=sqdatasetMat.sum(axis=1)
        distances=sqDistances**0.5
        sortDistIndicies=distances.argsort()
        classcount={}
        for i in range(k):
            voteLabel=labels[sortDistIndicies[i]]
            # 选择距离最小的 k个点
            classcount[voteLabel]=classcount.get(voteLabel,0)+1
            
        # 排序
        sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
        return sortClasscount[0][0]
    
    def classTest():
        haRatio=0.10
        dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
        normMat,ranges,minvals=Norm(dataMat)
        m=normMat.shape[0]
        numTestVecs=int(m*haRatio)
        errorcount=0.0
        
        for i in range(numTestVecs):
            classifierResult=classMovieTest(normMat[i,:],normMat[numTestVecs:m,:],dataLabels[numTestVecs:m],3)
            print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,dataLabels[i]))
            if (classifierResult!=dataLabels[i]):
                errorcount+=1.0
        print("The total error rate is:%d"%errorcount)
        print("The total error rate is:%f"%(errorcount/numTestVecs))
    
    classTest()
    
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:3,The real answer is:3
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:3,The real answer is:3
    The classifier came back with:3,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:3
    The classifier came back with:1,The real answer is:1
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:3
    The classifier came back with:3,The real answer is:3
    The classifier came back with:2,The real answer is:2
    The classifier came back with:1,The real answer is:1
    The classifier came back with:3,The real answer is:1
    The total error rate is:5
    The total error rate is:0.050000
    

    假设我们使用全部的训练集来进行训练,看是否能提高准确率?

    def classTest2():
        dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
        normMat,ranges,minvals=Norm(dataMat)
        m=normMat.shape[0]
        errorcount=0.0
        
        for i in range(m):
            classifierResult2=classMovieTest(normMat[i,:],normMat[:,:],dataLabels[:],3)
            
            if (classifierResult2!=dataLabels[i]):
                errorcount+=1.0
        print("The total error rate:",(errorcount/m))
    
    classTest2()
    
    The total error rate: 0.027
    

    结果表明,错误率从5%降低到2.7%,提高了准确率


    5.使用算法:构建完整可用系统

    def classifyPerson():
        resultList=["not at all","in small doses","in large doses"]
        percentTats=float(input("Percentage of time spent playing video games:"))
        ffMiles=float(input("Frequent flier miles earned per year:"))
        iceCream=float(input("liters of ice cream consumed per year:"))
        datingDataMat,datingLabels=fileTmatrix("./data/datingTestSet2.txt")
        normMat,ranges,minvals=Norm(datingDataMat)
        inArr=np.array([ffMiles,percentTats,iceCream])
        classifierResult=classMovieTest((inArr-minvals)/ranges,normMat,datingLabels,3)
        print("You will probably like thie person:",resultList[classifierResult-1])
    
    classifyPerson()
    
    Percentage of time spent playing video games: 10
    Frequent flier miles earned per year: 10000
    liters of ice cream consumed per year: 0.5
    
    
    You will probably like thie person: in small doses
    


    三.手写识别系统

    构造系统识别数字0到9.处理成具有相同的色彩和大小:宽高是32*32的黑白图像


    1.准备数据:将图像转换为测试向量

    实际图像存储在trainingDigits中包含了大约2000个例子,每个数字大约有200个样本;目录testDigits中包含了大约900个测试数据

    from IPython.display import Image
    
    Image(filename="./data/2_2.png",width=500)
    

    output_64_0.png

    Image(filename="./data/2_3.png",width=500)
    

    output_65_0.png

    Image(filename="./data/2_4.png",width=500)
    

    output_66_0.png

    我们将把一个32_32的二进制图像矩阵转换为1_1024的向量.首先编写一段函数imgTvector,将图像转换为向量

    def imgTvector(filename):
        returnVect=np.zeros((1,1024))
        fr=open(filename)
        for i in range(32):
            lineStr=fr.readline()
            for j in range(32):
                returnVect[0,32*i+j]=int(lineStr[j])
        return returnVect
    
    testVector=imgTvector("./data/digits/testDigits/0_13.txt")
    
    testVector[0,0:31]
    
    array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
           1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
    


    2.测试算法:使用k近邻算法识别手写数字

    from os import listdir
    
    def handwritingClassTest():
        hwLabels=[]
        trainingFileList=listdir("./data/digits/trainingDigits/")
        m=len(trainingFileList)
        trainingMat=np.zeros((m,1024))
        for i in range(m):
            fileNameStr=trainingFileList[i]
            fileStr=fileNameStr.split(".")[0]
            classNumStr=int(fileStr.split("_")[0])
            hwLabels.append(classNumStr)
            trainingMat[i,:]=imgTvector("./data/digits/trainingDigits/%s"%fileNameStr)
        testFileList=listdir("./data/digits/testDigits/")
        errorCount=0.0
        mTest=len(testFileList)
        for i in range(mTest):
            fileNameStr=testFileList[i]
            fileStr=fileNameStr.split(".")[0]
            classNumStr=int(fileStr.split("_")[0])
            vectorUnderTest=imgTvector("./data/digits/testDigits/%s"%fileNameStr)
            classifierResult=classMovieTest(vectorUnderTest,trainingMat,hwLabels,3)
            print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,classNumStr))
            if (classifierResult!=classNumStr):
                errorCount+=1.0
        print("The total number of errors is:%d"%errorCount)
        print("The total error rate is:%f"%(errorCount/float(mTest)))
    
    handwritingClassTest()
    
    The classifier came back with:0,The real answer is:0
    The classifier came back with:0,The real answer is:0
    The classifier came back with:0,The real answer is:0
    The classifier came back with:0,The real answer is:0
    The classifier came back with:0,The real answer is:0
    The classifier came back with:0,The real answer is:0
    .
    .
    .
    The classifier came back with:9,The real answer is:9
    The classifier came back with:9,The real answer is:9
    The classifier came back with:9,The real answer is:9
    The total number of errors is:10
    The total error rate is:0.010571
    

    k近邻算法识别手写数字数据集,错误率为1.1%

  • 相关阅读:
    赫夫曼树相关算法
    用栈来实现 括号匹配 字符序列检验
    二叉树的建立和遍历
    数据结构-算术表达式求值
    构造一个单链表L,其头结点指针为head,编写程序实现将L逆置
    单链表的基本操作(C语言)数据结构
    java代码打印杨辉三角
    无标题
    写一个方法,判断给定的数字是偶数还是奇数
    关于生物信息学与R的相关资料和网站
  • 原文地址:https://www.cnblogs.com/LQ6H/p/12940582.html
Copyright © 2011-2022 走看看