从大一开始接触TF-IDF,一直觉得这个特别简单,,但是图样图森破,,,
即使现在来说,也似乎并非完全搞懂
核心思想:
计算词语在该文章中权重,与词语出现次数和词语价值有关
词语出现次数,重复即强调,越重要
词语价值,出现在越多的文档中越滥情,越廉价
公式:
词频TF = 出现次数 / 总次数
逆向文件频率IDF = log( 总文档数 / ( 出现文档数+1) )
TF-IDF = TF * IDF
具体计算:
1.我的代码:
# 由于算这个是为了求feature值,因此用了jieba,轻量级好用的分词包,具体可参见它的github:https://github.com/hosiet/jieba
# 并且最终计算结果用json存储在文件中
起初,自己写了个代码计算
1 #coding=utf-8 2 import jieba 3 import re 4 import math 5 import json 6 7 with open('stop_words.txt', 'r', encoding='utf-8') as f: 8 stopwords = [x[:-1] for x in f] 9 10 data = [] 11 tf = {} 12 doc_num = {} 13 tfidf = {} 14 15 def calcu_tf(): 16 '''计算tf值''' 17 with open('exercise.txt', 'r', encoding='utf-8') as f: 18 lines = f.readlines() 19 global TOTAL 20 TOTAL = 0 21 for l in lines: 22 # 使用jieba分词 23 lx = re.sub('\W', '', l) 24 list = jieba.lcut(lx) 25 # 每句话中一个词可能出现多次 26 tmp = {} 27 for i in list: 28 if(i not in doc_num): 29 doc_num[i] = 0 30 if (i not in stopwords)and(i not in tmp): 31 data.append(i) 32 # 计算出现在多少个文档里 33 tmp[i] = 1 34 doc_num[i] += 1 35 # 计算总文档数 36 TOTAL += 1 37 dataset = set(data) 38 for i in dataset: 39 tf[i] = data.count(i) 40 41 42 def calcu_tfidf(): 43 '''计算TF-IDF值''' 44 for i in tf: 45 tfidf[i] = tf[i] * math.log10(TOTAL / (doc_num[i]+1)) 46 47 if __name__ == '__main__' : 48 calcu_tf() 49 calcu_tfidf() 50 print(tfidf) 51 with open('tfidf.json', 'w', encoding="utf-8") as file: 52 # json.dumps需要设置一下参数,不然文件中全是/u什么的 53 file.write(json.dumps(tfidf, ensure_ascii=False, indent=2))
是自己设置的测试文档。。以及运算结果(部分截图)
最终用时1.54041444018928秒
2.使用sklearn包
但后来觉得,有现成能用就用现成的,毕竟少好多代码
于是,使用scikit-learn计算TF-IDF值就诞生了
# sklearn包的安装另一篇博客中有写http://www.cnblogs.com/rucwxb/p/7297733.html
计算过程:
CountVectorizer计算TF
TFidfTransformer计算IDF
核心代码:
1 from sklearn.feature_extraction.text import CountVectorizer 2 from sklearn.feature_extraction.text import TfidfTransformer 3 from numpy import * 4 import time 5 import jieba 6 import re 7 8 9 def calcu_tfidf(): 10 corpus = [] 11 idfDic = {} 12 tf = {} 13 tfs = [] 14 tfidf = {} 15 with open('exercise.txt', 'r', encoding='utf-8') as f: 16 for x in f: 17 lx = re.sub('\W', '', x) 18 jb = jieba.lcut(lx) 19 list = [] 20 for i in jb: 21 if i not in stopwords: 22 list.append(i) 23 list = " ".join(list) 24 corpus.append(list) 25 #将文本中的词语转换为词频矩阵 26 vectorizer = CountVectorizer(ngram_range=(1, 1), lowercase=False, token_pattern = r'\b\w+\b', min_df = 1) 27 #类调用 28 transformer = TfidfTransformer() 29 #计算个词语出现的次数 30 tf_mat = vectorizer.fit_transform(corpus) 31 tfidf = transformer.fit_transform(tf_mat) 32 #获取词袋中所有文本关键词 33 words = vectorizer.get_feature_names() 34 # 获得IDF和TF值 35 tfs = tf_mat.sum(axis=0).tolist() 36 for i, word in enumerate(words): 37 idfDic[word] = transformer.idf_[i] 38 tf[word] = tfs[i] 39 # 计算TF-IDF 40 for i in words: 41 tfidf[i] = idfDic[i] * tf[i] 42 43 44 if __name__ == '__main__' : 45 startT = time.clock() 46 with open('stop_words.txt', 'r', encoding='utf-8') as f: 47 stopwords = [x[:-1] for x in f] 48 calcu_tfidf() 49 with open('tfidf2.json', 'w', encoding="utf-8") as file: 50 # json.dumps需要设置一下参数,不然文件中全是/u什么的 51 file.write(json.dumps(tfidf, ensure_ascii=False, indent=2)) 52 endT = time.clock() 53 print(endT-startT)