ML–k近邻算法
本节内容:
- k近邻分类算法
- 从文本文件中解析和导入数据
- 使用Matplotlib创建扩散图
- 归一化数值
一.K近邻算法概述
简单地说,k近邻算法采用测量不同特征值之间的距离方法进行分类
k近邻算法
优点:精度高,对异常值不敏感,无数据输入假定
缺点:计算复杂度高,空间复杂度高
适用数据范围:数值型和标称型
使用k近邻算法分类爱情片和动作片,根据电影的打斗镜头和接吻镜头,确定是爱情片还是动作片?
from IPython.display import Image
Image(filename="./data/2_1.png",width=500)
首先我们需要知道这个未知电影存在多少个打斗镜头和接吻镜头,"?"是该未知电影出现的镜头数图形化展示
电影名称 | 打斗镜头 | 接吻镜头 | 电影类型 |
---|---|---|---|
California Man | 3 | 104 | 爱情片 |
He’s Not Really into Dudes | 2 | 100 | 爱情片 |
Beautiful Woman | 1 | 81 | 爱情片 |
Kevin Longblade | 101 | 10 | 动作片 |
Robo Slayer 3000 | 99 | 5 | 动作片 |
Amped II | 98 | 2 | 动作片 |
? | 18 | 90 | 未知 |
即使不知道未知电影属于哪种类型,我们也可以通过某种方法计算出来.首先计算未知电影与样本集中其他电影的距离
电影名称 | 与未知电影的距离 |
---|---|
Cafifornia Man | 20.5 |
He’s Not Really into Dudes | 18.7 |
Beautiful Woman | 19.2 |
Kevin Longblade | 115.3 |
Robo Slayer 3000 | 117.4 |
Amped II | 118.9 |
现在我们得到了样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到k个距离最近的电影.假定k=3
则三个最靠近的电影依次是He’s Not Really into Dudes,Beautiful Woman和California Man.k近邻算法
按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片
k近邻算法的一般流程
- 收集数据:可以使用任何方法
- 准备数据:距离计算所需要的数值
- 分析数据:可以使用任何方法
- 训练算法:此步骤不适用于k近邻算法
- 测试算法:计算错误率
- 使用算法:首先需要输入样本数据和结构化的输出结果,然后运行k近邻算法判定输入数据分别属于哪个分类,最后应用对计算出的分类执行后续的处理
1.准备:使用python导入数据
import numpy as np
import operator
def createDataSet():
dataset=np.array([[3,104],[2,100],[1,81],[101,10],[99,5],[98,2]])
labels=["爱情片","爱情片","爱情片","动作片","动作片","动作片"]
return dataset,labels
dataset,labels=createDataSet()
dataset
array([[ 3, 104],
[ 2, 100],
[ 1, 81],
[101, 10],
[ 99, 5],
[ 98, 2]])
labels
['爱情片', '爱情片', '爱情片', '动作片', '动作片', '动作片']
向量labels包含了每个数据点的标签信息,labels包含的元素个数等于dataset矩阵行行数.红色点是爱情片,蓝色点是动作片
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot([3,2,1],[104,100,81],"ro",[101,99,98],[10,5,2],"b^")
[<matplotlib.lines.Line2D at 0x2075b9f8358>,
<matplotlib.lines.Line2D at 0x2075b9f8470>]
2.实施KNN分类算法
对未知类比属性的数据集中的每个点依次执行以下操作:
- 计算已知类别数据集中的每个点依次执行以下操作
- 按照距离递增次序排序
- 选取与当前点距离最小的k个点
- 确定前k个点所在类别的出现频率
- 返回前k个点出现频率最高的类别作为当前点的预测分类
def classMovieTest(X,dataset,labels,k):
"""
:param x: 用于分类的输入向量
:param dataset: 输入的训练样本集
:param labels: 标签向量
:param k: 用于选择最近邻居的数目
:return: 分类标签;与已知样本的距离
"""
# 距离计算
datasetSize=dataset.shape[0]
datasetMat=np.tile(X,(datasetSize,1))-dataset
sqdatasetMat=datasetMat**2
sqDistances=sqdatasetMat.sum(axis=1)
distances=sqDistances**0.5
sortDistIndicies=distances.argsort()
classcount={}
for i in range(k):
voteLabel=labels[sortDistIndicies[i]]
# 选择距离最小的 k个点
classcount[voteLabel]=classcount.get(voteLabel,0)+1
# 排序
sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
return sortClasscount[0][0],distances
预测数据所在分类,输入X=[18,90]
,其输出结果应该与上面分析一致
classMovieTest([18,90],dataset,labels,3)
('爱情片', array([ 20.51828453, 18.86796226, 19.23538406, 115.27792503,
117.41379817, 118.92854998]))
二.使用k近邻算法改进约会网站的配对效果
三种类型的人:
- 不喜欢的人
- 魅力一般的人
- 极具魅力的人
1.准备数据:从文本文件中解析数据
数据放在文本文件datingTestSet2.txt中,每个样本数据占据一行,总共有1000行.样本主要包含以下3种特征:
- 每年获得的飞行常客里程数
- 玩视频游戏所耗时间百分比
- 每周消费的冰淇淋公升数
创建名为fileTmatrix
的函数,以此来处理输入格式问题.该函数的输入为文件名字符串,输出为训练样本矩阵和类标签向量
def fileTmatrix(filename):
"""
:param filename: 数据集文件名
:return: 训练数据矩阵;类标签向量
"""
fr=open(filename)
arrayLines=fr.readlines()
# 得到文件行数
numberLines=len(arrayLines)
# 创建返回的Numpy矩阵
datasetMat=np.zeros((numberLines,3))
classLabelVector=[]
index=0
# 解析文件数据到列表
for line in arrayLines:
line=line.strip()
listFromLine=line.split(" ")
datasetMat[index,:]=listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index+=1
return datasetMat,classLabelVector
dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
dataMat
array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01],
[1.4488000e+04, 7.1534690e+00, 1.6739040e+00],
[2.6052000e+04, 1.4418710e+00, 8.0512400e-01],
...,
[2.6575000e+04, 1.0650102e+01, 8.6662700e-01],
[4.8111000e+04, 9.1345280e+00, 7.2804500e-01],
[4.3757000e+04, 7.8826010e+00, 1.3324460e+00]])
dataLabels[0:20]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]
2.分析数据:使用Matplotlib创建散点图
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.plot(dataMat[:,1],dataMat[:,2],"bo")
plt.xlabel("Percentage of Time Spent Playing Video Games")
plt.ylabel("Liters of ice cream consumed per week")
plt.show()
Matplotlib
库提供的scatter
函数支持个性化标记散点图上的点
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(dataMat[:,1],dataMat[:,2],15.0*np.array(dataLabels),15.0*np.array(dataLabels))
<matplotlib.collections.PathCollection at 0x2075c05ea58>
使用数据矩阵dataMat的第一和第二列属性却可以得到更好的效果,图中清晰地标识了三个不同的样本分类区域,具有不同爱好的人其类别区域也不同
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(dataMat[:,0],dataMat[:,1],15.0*np.array(dataLabels),15.0*np.array(dataLabels))
<matplotlib.collections.PathCollection at 0x2075d1d50b8>
3.准备数据:归一化数值
将取值范围的特征值转化为0到1区间内的值:
newValue=(oldValue-min)/(max-min)
使用函数Norm
将数字特征值转化为0到1的区间
def Norm(dataset):
"""
:param dataset: 数据集
:return: 归一化数据集;极值差;最小值
"""
# 参数0使得函数可以从列中选取最小值
minVal=dataset.min(0)
maxVal=dataset.max(0)
ranges=maxVal-minVal
normDataset=np.zeros(np.shape(dataset))
m=dataset.shape[0]
normDataset=dataset-np.tile(minVal,(m,1))
# 特征值相除
normDataset=normDataset/np.tile(ranges,(m,1))
return normDataset,ranges,minVal
normMat,ranges,minVal=Norm(dataMat)
normMat
array([[0.44832535, 0.39805139, 0.56233353],
[0.15873259, 0.34195467, 0.98724416],
[0.28542943, 0.06892523, 0.47449629],
...,
[0.29115949, 0.50910294, 0.51079493],
[0.52711097, 0.43665451, 0.4290048 ],
[0.47940793, 0.3768091 , 0.78571804]])
ranges
array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])
minVal
array([0. , 0. , 0.001156])
4.测试算法:作为完整程序验证分类器
def classMovieTest(X,dataset,labels,k):
"""
:param x: 用于分类的输入向量
:param dataset: 输入的训练样本集
:param labels: 标签向量
:param k: 用于选择最近邻居的数目
:return: 分类标签
"""
# 距离计算
datasetSize=dataset.shape[0]
datasetMat=np.tile(X,(datasetSize,1))-dataset
sqdatasetMat=datasetMat**2
sqDistances=sqdatasetMat.sum(axis=1)
distances=sqDistances**0.5
sortDistIndicies=distances.argsort()
classcount={}
for i in range(k):
voteLabel=labels[sortDistIndicies[i]]
# 选择距离最小的 k个点
classcount[voteLabel]=classcount.get(voteLabel,0)+1
# 排序
sortClasscount=sorted(classcount.items(),key=operator.itemgetter(1),reverse=True)
return sortClasscount[0][0]
def classTest():
haRatio=0.10
dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
normMat,ranges,minvals=Norm(dataMat)
m=normMat.shape[0]
numTestVecs=int(m*haRatio)
errorcount=0.0
for i in range(numTestVecs):
classifierResult=classMovieTest(normMat[i,:],normMat[numTestVecs:m,:],dataLabels[numTestVecs:m],3)
print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,dataLabels[i]))
if (classifierResult!=dataLabels[i]):
errorcount+=1.0
print("The total error rate is:%d"%errorcount)
print("The total error rate is:%f"%(errorcount/numTestVecs))
classTest()
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:3
The classifier came back with:1,The real answer is:1
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:3
The classifier came back with:3,The real answer is:3
The classifier came back with:2,The real answer is:2
The classifier came back with:1,The real answer is:1
The classifier came back with:3,The real answer is:1
The total error rate is:5
The total error rate is:0.050000
假设我们使用全部的训练集来进行训练,看是否能提高准确率?
def classTest2():
dataMat,dataLabels=fileTmatrix("./data/datingTestSet2.txt")
normMat,ranges,minvals=Norm(dataMat)
m=normMat.shape[0]
errorcount=0.0
for i in range(m):
classifierResult2=classMovieTest(normMat[i,:],normMat[:,:],dataLabels[:],3)
if (classifierResult2!=dataLabels[i]):
errorcount+=1.0
print("The total error rate:",(errorcount/m))
classTest2()
The total error rate: 0.027
结果表明,错误率从5%降低到2.7%,提高了准确率
5.使用算法:构建完整可用系统
def classifyPerson():
resultList=["not at all","in small doses","in large doses"]
percentTats=float(input("Percentage of time spent playing video games:"))
ffMiles=float(input("Frequent flier miles earned per year:"))
iceCream=float(input("liters of ice cream consumed per year:"))
datingDataMat,datingLabels=fileTmatrix("./data/datingTestSet2.txt")
normMat,ranges,minvals=Norm(datingDataMat)
inArr=np.array([ffMiles,percentTats,iceCream])
classifierResult=classMovieTest((inArr-minvals)/ranges,normMat,datingLabels,3)
print("You will probably like thie person:",resultList[classifierResult-1])
classifyPerson()
Percentage of time spent playing video games: 10
Frequent flier miles earned per year: 10000
liters of ice cream consumed per year: 0.5
You will probably like thie person: in small doses
三.手写识别系统
构造系统识别数字0到9.处理成具有相同的色彩和大小:宽高是32*32的黑白图像
1.准备数据:将图像转换为测试向量
实际图像存储在trainingDigits中包含了大约2000个例子,每个数字大约有200个样本;目录testDigits中包含了大约900个测试数据
from IPython.display import Image
Image(filename="./data/2_2.png",width=500)
Image(filename="./data/2_3.png",width=500)
Image(filename="./data/2_4.png",width=500)
我们将把一个32_32的二进制图像矩阵转换为1_1024的向量.首先编写一段函数imgTvector,将图像转换为向量
def imgTvector(filename):
returnVect=np.zeros((1,1024))
fr=open(filename)
for i in range(32):
lineStr=fr.readline()
for j in range(32):
returnVect[0,32*i+j]=int(lineStr[j])
return returnVect
testVector=imgTvector("./data/digits/testDigits/0_13.txt")
testVector[0,0:31]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.,
1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
2.测试算法:使用k近邻算法识别手写数字
from os import listdir
def handwritingClassTest():
hwLabels=[]
trainingFileList=listdir("./data/digits/trainingDigits/")
m=len(trainingFileList)
trainingMat=np.zeros((m,1024))
for i in range(m):
fileNameStr=trainingFileList[i]
fileStr=fileNameStr.split(".")[0]
classNumStr=int(fileStr.split("_")[0])
hwLabels.append(classNumStr)
trainingMat[i,:]=imgTvector("./data/digits/trainingDigits/%s"%fileNameStr)
testFileList=listdir("./data/digits/testDigits/")
errorCount=0.0
mTest=len(testFileList)
for i in range(mTest):
fileNameStr=testFileList[i]
fileStr=fileNameStr.split(".")[0]
classNumStr=int(fileStr.split("_")[0])
vectorUnderTest=imgTvector("./data/digits/testDigits/%s"%fileNameStr)
classifierResult=classMovieTest(vectorUnderTest,trainingMat,hwLabels,3)
print("The classifier came back with:%d,The real answer is:%d"%(classifierResult,classNumStr))
if (classifierResult!=classNumStr):
errorCount+=1.0
print("The total number of errors is:%d"%errorCount)
print("The total error rate is:%f"%(errorCount/float(mTest)))
handwritingClassTest()
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
The classifier came back with:0,The real answer is:0
.
.
.
The classifier came back with:9,The real answer is:9
The classifier came back with:9,The real answer is:9
The classifier came back with:9,The real answer is:9
The total number of errors is:10
The total error rate is:0.010571
k近邻算法识别手写数字数据集,错误率为1.1%