通过少量训练数据及其对应类别,我们训练出了能对未来数据分类的模型,这种方法叫做有监督学习,这是因为这个学习过程是在老师的监督下完成的,这个老师就是数据的正确类别。当我们没有标签可以让分类模型去学习时,我们将使用聚类来实现这个目标。聚类使得相同数据处于同一簇中,不相似数据在不同簇中。然而在寻找相似帖子时,最棘手的问题是如何度量两个文本文件之间的相似度。
从机器学习角度上,原文的文本用处并不大 ,只有我们将它转换为有意义的数据时,才可以传入机器学习算法。对文本的操作,如相似度也是这样做。文本的相似度衡量有一种比较健壮的方法叫做词袋(bag-of-word)方法。该方法基于简单的词频率统计。对每一个帖子出现的词语,将它记录下来并表示成向量,这一步叫做向量化。关于词频统计,在Scikit包中的CountVectorizer可以很高效做好这个工作。
from sklearn.feature_extraction.text import CountVectorizer #创建CountVector对象实例 vectorizer = CountVectorizer() # 文本如下 content = ['a large typically semicircular in cross section', 'the it convex in cross a'] x = vectorizer.fit_transform(content) print vectorizer.get_feature_names() print x.toarray().transpose()
输出如下:
[u'convex', u'cross', u'in', u'it', u'large', u'section', u'semicircular', u'the', u'typically']
[[0 1]
[1 1]
[1 1]
[0 1]
[1 0]
[1 0]
[1 0]
[0 1]
[1 0]]
将x转换为ndarray对象,从x中提取了特征向量。
从简单开始着手,先对简单数据进行试验。下面测试的是5个文件的类容:
from sklearn.feature_extraction.text import CountVectorizer import os from pprint import pprint import scipy as sp path = r'C:UsersTDDesktopdataMachine Learning1400OS_03_Codesdata oy' # 获取文本 posts = [open(os.path.join(path,f),'r').read() for f in os.listdir(path)] vectorizer = CountVectorizer(min_df = 1) x_train = vectorizer.fit_transform(posts) num_samples,num_features = x_train.shape # 获得样本数和单词数,样本5,单词25 # 对新帖子进行向量化 new_post_vec = vectorizer.transform(['imaging databases']) def dis_vec(v1, v2): delta = v1-v2 # 计算欧几里得距离 return sp.linalg.norm(delta.toarray()) for i in range(num_samples): post_vec = x_train.getrow(i) d = dis_vec(post_vec,new_post_vec) print "post {} with distance {:.2f} : {}".format(i+1,d,posts[i])
post 1 with distance 4.00 : This is a toy post about machine learning. Actually, it contains not much interesting stuff.
post 2 with distance 1.73 : Imaging databases provide storage capabilities.
post 3 with distance 2.00 : Most imaging databases safe images permanently.
post 4 with distance 1.41 : Imaging databases store data.
post 5 with distance 5.10 : Imaging databases store data. Imaging databases store data. Imaging databases store data.
上面显示了 post 4 是最接近的帖子。但是问题也出现了,post 5相对于post 4只是重复了3边,理论计算距离应该相同。很显然,使用原始词频统计过于简单。我们需要将它标量化到长度为1的单位向量。
def dis_vec(v1, v2): v1 = v1.toarray()*1./sp.linalg.norm(v1.toarray()) v2 = v2.toarray()*1./sp.linalg.norm(v2.toarray()) # 计算欧几里得距离 return sp.linalg.norm(v1 -v2)
得到结果如下:
post 1 with distance 1.41 : This is a toy post about machine learning. Actually, it contains not much interesting stuff. post 2 with distance 0.86 : Imaging databases provide storage capabilities. post 3 with distance 0.92 : Most imaging databases safe images permanently. post 4 with distance 0.77 : Imaging databases store data. post 5 with distance 0.77 : Imaging databases store data. Imaging databases store data. Imaging databases store data.
这样看起来很科学,post 4和post 5的距离完全一样了。