zoukankan      html  css  js  c++  java
  • 聚类---寻找相关帖子

        通过少量训练数据及其对应类别,我们训练出了能对未来数据分类的模型,这种方法叫做有监督学习,这是因为这个学习过程是在老师的监督下完成的,这个老师就是数据的正确类别。当我们没有标签可以让分类模型去学习时,我们将使用聚类来实现这个目标。聚类使得相同数据处于同一簇中,不相似数据在不同簇中。然而在寻找相似帖子时,最棘手的问题是如何度量两个文本文件之间的相似度。

       从机器学习角度上,原文的文本用处并不大 ,只有我们将它转换为有意义的数据时,才可以传入机器学习算法。对文本的操作,如相似度也是这样做。文本的相似度衡量有一种比较健壮的方法叫做词袋(bag-of-word)方法。该方法基于简单的词频率统计。对每一个帖子出现的词语,将它记录下来并表示成向量,这一步叫做向量化。关于词频统计,在Scikit包中的CountVectorizer可以很高效做好这个工作。

    from sklearn.feature_extraction.text import CountVectorizer
    #创建CountVector对象实例
    vectorizer = CountVectorizer()
    # 文本如下
    content = ['a large typically semicircular in cross section', 'the it convex in cross a']
    x = vectorizer.fit_transform(content)
    print vectorizer.get_feature_names() 
    print x.toarray().transpose()
    

     输出如下:

    [u'convex', u'cross', u'in', u'it', u'large', u'section', u'semicircular', u'the', u'typically']
    [[0 1]
    [1 1]
    [1 1]
    [0 1]
    [1 0]
    [1 0]
    [1 0]
    [0 1]
    [1 0]]

    将x转换为ndarray对象,从x中提取了特征向量。

    从简单开始着手,先对简单数据进行试验。下面测试的是5个文件的类容:

    from sklearn.feature_extraction.text import CountVectorizer
    import os
    from pprint import pprint
    import scipy as sp
    
    path = r'C:UsersTDDesktopdataMachine Learning1400OS_03_Codesdata	oy'
    # 获取文本
    posts = [open(os.path.join(path,f),'r').read() for f in os.listdir(path)]
    vectorizer = CountVectorizer(min_df = 1)
    x_train = vectorizer.fit_transform(posts)
    num_samples,num_features = x_train.shape # 获得样本数和单词数,样本5,单词25
    # 对新帖子进行向量化
    new_post_vec = vectorizer.transform(['imaging databases'])
    def dis_vec(v1, v2):
    	delta = v1-v2
    	# 计算欧几里得距离
    	return sp.linalg.norm(delta.toarray())
    for i in range(num_samples):
    	post_vec =  x_train.getrow(i)
    	d = dis_vec(post_vec,new_post_vec)
    	print "post {} with distance {:.2f} :  {}".format(i+1,d,posts[i])
    

    post 1 with distance 4.00 : This is a toy post about machine learning. Actually, it contains not much interesting stuff.
    post 2 with distance 1.73 : Imaging databases provide storage capabilities.
    post 3 with distance 2.00 : Most imaging databases safe images permanently.
    post 4 with distance 1.41 : Imaging databases store data.
    post 5 with distance 5.10 : Imaging databases store data. Imaging databases store data. Imaging databases store data.

    上面显示了 post 4 是最接近的帖子。但是问题也出现了,post 5相对于post 4只是重复了3边,理论计算距离应该相同。很显然,使用原始词频统计过于简单。我们需要将它标量化到长度为1的单位向量。

    def dis_vec(v1, v2):
    	v1 = v1.toarray()*1./sp.linalg.norm(v1.toarray())
    	v2 = v2.toarray()*1./sp.linalg.norm(v2.toarray())
    	# 计算欧几里得距离
    	return sp.linalg.norm(v1 -v2)
    

    得到结果如下:

    post 1 with distance 1.41 :  This is a toy post about machine learning. Actually, it contains not much interesting stuff.
    post 2 with distance 0.86 :  Imaging databases provide storage capabilities.
    post 3 with distance 0.92 :  Most imaging databases safe images permanently.
    post 4 with distance 0.77 :  Imaging databases store data.
    post 5 with distance 0.77 :  Imaging databases store data. Imaging databases store data. Imaging databases store data.
    

    这样看起来很科学,post 4和post 5的距离完全一样了。

  • 相关阅读:
    s4-9 二层设备
    s4-9 二层设备
    s5-1 网络层引言
    LeetCode Factorial Trailing Zeroes (阶乘后缀零)
    UVA 658 It's not a Bug, it's a Feature! (最短路,经典)
    UVA 1151 Buy or Build (MST最小生成树,kruscal,变形)
    LeetCode Reverse Linked List (反置链表)
    LeetCode Contains Duplicate (判断重复元素)
    UVA 1395 Slim Span (最小生成树,MST,kruscal)
    割点,桥,边双连通分量,点双连通分量
  • 原文地址:https://www.cnblogs.com/td15980891505/p/6024290.html
Copyright © 2011-2022 走看看