<tf-idf + 余弦相似度> 计算文章的相似度

zoukankan html css js c++ java

<tf-idf + 余弦相似度> 计算文章的相似度

背景知识:

（1）tf-idf

按照词TF-IDF值来衡量该词在该文档中的重要性的指导思想：如果某个词比较少见，但是它在这篇文章中多次出现，那么它很可能就反映了这篇文章的特性，正是我们所需要的关键词。

tf–idf is the product of two statistics, term frequency and inverse document frequency.

     //Various ways for determining the exact values of both statistics exist.

tf–idf= tf×idf

In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d.

Other possibilities include:

- Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise;

- logarithmically scaled frequency: tf(t,d) = 1 + log f_t,d, or zero if ft,d is zero;

- augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the document:

        tf(t,d)=0.5+0.5*f_t,d/max(f_t'd)

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

（2）余弦相似度

余弦值的范围在[-1,1]之间，值越趋近于1，代表两个向量的方向越接近；越趋近于-1，他们的方向越相反；接近于0，表示两个向量近乎于正交。

一般情况下，相似度都是归一化到[0,1]区间内，因此余弦相似度表示为cosineSIM=0.5cosθ+0.5

计算过程：

（1）使用TF-IDF算法，找出两篇文章的关键词；

（2）每篇文章各取出若干个关键词（为公平起见，一般取的词数相同），合并成一个集合，计算每篇文章对于这个集合中的词的词频

（注1：为了避免文章长度的差异，可以使用相对词频；注2：这一步选出的不同词的数量决定了词频向量的长度）；

（3）生成两篇文章各自的词频向量（注：所有文章对应的词频向量等长，相同位置的元素对应同一词）；

（4）计算两个向量的余弦相似度，值越大就表示越相似。

Note that: tf-idf值只在第一步用到。

举例说明：

文章A：我喜欢看小说。

文章B：我不喜欢看电视，也不喜欢看电影。

第一步：分词

文章A：我/喜欢/看/小说。

    文章B：我/不/喜欢/看/电视，也/不/喜欢/看/电影。

第二步，列出所有的词。

     我，喜欢，看，小说，电视，电影，不，也。

第三步，计算每个文档中各个词的词频tf。

　　文章A：我 1，喜欢 1，看 1，小说 1，电视 0，电影 0，不 0，也 0。

　　文章B：我 1，喜欢 2，看 2，小说 0，电视 1，电影 1，不 2，也 1。

第四步，计算各个词的逆文档频率idf。

　　我 log(2/2)=0，喜欢 log(2/2)=0，看 log(2/2)=0，小说 log(2/1)=1，电视 log(2/1)=1，电影 log(2/1)=1，不 log(2/1)=1，也 log(2/1)=1。

第五步：计算每个文档中各个词的tf-idf值

　　文章A：我 0，喜欢 0，看 0，小说 1，电视 0，电影 0，不 0，也 0。

　　文章B：我 0，喜欢 0，看 0，小说 0，电视 1，电影 1，不 1，也 1。

第六步：选择每篇文章的关键词（这里选tf-idf排名前3的词作为关键词（至于并列大小的随机选））

　　文章A：我 0，喜欢 0，小说 1

　　文章B：电视 1，电影 1，不 1

第七步：构建用于计算相似度的词频向量（根据上一步选出的词：我，喜欢，小说，电视，电影，不）

   文章A：[1 1 1 0 0 0]

文章B： [1 2 0 1 1 2]

第八步：计算余弦相似度值

cosθ=3/sqrt(33)= 0.5222329678670935

　　　cosineSIM(A，B)=0.5222329678670935*0.5+0.5=0.7611164839335467

references：

(1) https://en.wikipedia.org/wiki/Tf%E2%80%93idf

(2) http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

查看全文

相关阅读:
android 图片全屏
 .9.png
C++中的endl
C++输入输出cin与cout
word-search
Java中的的画正三角方法
 octave中的一些基本操作
 C#中判断语句 if、if-else if、switch-case
C#中的异常处理（try-catch的使用）——使程序更加稳定
 编程&blog处女篇-用C#求100以内的质数

原文地址：https://www.cnblogs.com/wxiaoli/p/6940702.html