TF-IDF的全称为:term frequence|inverse document frequence,它是揭示一个词对文档重要性的数字统计。
From wikipedia.org
The tf*idf weight (term frequency–inverse document frequency, a.k.a. TF-IDF) is a numerical statistic which reflects how important a word is to adocument in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increasesproportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
tf-idf可以揭示一个查询词对文档的重要性,随着词频在文章中的升高,tf升高,但是又随着term在整个语料库中的频率增大,idf随着减小。所以综合起来能反映term的重要性。
idf=log(N/n(t)),其中N代表总的文档数,n(t)代表含有term的文档的数量。log是递增函数,可知N/n(t)>1,n(t)越大,含有这个term的文档数越多,说明这个term太常见了,idf就越小。
最后用if*idf来表明一个term的重要性。