zoukankan      html  css  js  c++  java
  • 【数据挖掘】分类之kNN(转载)

    【数据挖掘】分类之kNN

    1.算法简介

    kNN的思想很简单:计算待分类的数据点与训练集所有样本点,取距离最近的k个样本;统计这k个样本的类别数量;根据多数表决方案,取数量最多的那一类作为待测样本的类别。距离度量可采用Euclidean distance,Manhattan distance和cosine。

    Iris数据集作为测试,代码参考[1]

    import numpy as np  
    import scipy.spatial.distance as ssd  
      
    def read_data(fn):  
        """ read dataset and separate into characteristics data 
            and label data 
        """  
       
        # read dataset file  
        with open(fn) as f:  
            raw_data = np.loadtxt(f, delimiter= ',', dtype="float",   
                skiprows=1, usecols=None)  
      
        #initialize  
        charac=[]; label=[]  
        #obtain input characrisitics and label  
        for row in raw_data:  
            charac.append(row[:-1])  
            label.append(int (row[-1]))  
        return np.array(charac),np.array(label)  
      
    def knn(k,dtrain,dtest,dtr_label):  
        """k-nearest neighbors algorithm"""  
      
        pred_label=[]  
        #for each instance in test dataset, calculate  
        #distance in respect to train dataset  
        for di in dtest:  
            distances=[]  
            for ij,dj in enumerate(dtrain):  
                distances.append((ssd.euclidean(di,dj),ij))  
      
            #sort the distances to get k-neighbors  
            k_nn=sorted(distances)[:k]  
      
            #classify accroding to the maxmium label  
            dlabel=[]  
            for dis,idtr in k_nn:  
                dlabel.append(dtr_label[idtr])  
            pred_label.append(np.argmax(np.bincount(dlabel)))  
      
        return pred_label  
      
    def evaluate(result):  
        """evaluate the predicited label"""  
      
        eval_result=np.zeros(2,int)  
        for x in result:  
            #pred_label==dte_label  
            if x==0:  
                eval_result[0]+=1  
            #pred_label!=dte_label  
            else:  
                eval_result[1]+=1  
      
        return eval_result  
      
      
    dtrain,dtr_label=read_data('iris-train.csv')  
    dtest,dte_label=read_data('iris-test.csv')  
      
    K=[1,3,7,11]  
      
    print "knn classification result for iris data set:
    "  
    print "k    | number of correct/wrong classified test records"  
      
    for k in K:  
        pred_label=knn(k,dtrain,dtest,dtr_label)  
        eval_result=evaluate(pred_label-dte_label)  
      
        #print the evaluted result into screen  
        print k,"   | ", eval_result[0], "/", eval_result[1]  
      
    print  

    2. Referrence

    [1] M. Saad Nurul Ishlah, Python: Simple K Nearest Neighbours Classifier.

  • 相关阅读:
    linux 命令收集
    tomcat + nginx 负载均衡
    lamp + 然之协同
    万能的 命令库
    boost.asio源码剖析(三) 流程分析
    boost.asio源码剖析(一) 前 言
    给你的JAVA程序配置参数(Properties的使用)
    JAVA将Excel中的报表导出为图片格式(三)换一种实现
    JAVA使用apache http组件发送POST请求
    JAVA使用原始HttpURLConnection发送POST数据
  • 原文地址:https://www.cnblogs.com/Vae1990Silence/p/7326467.html
Copyright © 2011-2022 走看看