k邻近算法的伪代码:
对未知类别属性的数据集中的每个点一次执行以下操作:
(1)计算已知类别数据集中的点与当前点之间的距离;
(2)按照距离递增次序排列
(3)选取与当前点距离最小的k个点
(4)确定前k个点所在类别的出现频率
(5)返回前k个点出现频率最好的类别作为当前点的预测分类
python函数实现
''' Created on Sep 16, 2010 kNN: k Nearest Neighbors Input: inX: vector to compare to existing dataset (1xN) dataSet: size m data set of known vectors (NxM) labels: data set labels (1xM vector) k: number of neighbors to use for comparison (should be an odd number) Output: the most popular class label @author: pbharrin ''' def classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] //输入的训练样本集dataSet的列数 diffMat = tile(inX, (dataSetSize,1)) - dataSet //先对inX进行向量化处理,使之格式与dataSet一致,然后相减 sqDiffMat = diffMat**2 //向量对应值差的平方 sqDistances = sqDiffMat.sum(axis=1)//列的平方和的汇总 distances = sqDistances**0.5 //开平方求距离 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 //选择距离最小的k个点 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) //排序 return sortedClassCount[0][0]