特征工程·TFIDF提取特征

zoukankan html css js c++ java

特征工程·TFIDF提取特征
本文介绍文本处理时比较常用且有效的tfidf特征提取方法

1. 提取tf特征

TF即是词频(Term Frequency)是文本信息量统计方法之一，简单来说就是统计此文本中每个词的出现频率
def computeTF(wordDict, bow): tfDict = {} bowCount = len(bow) for word, count in wordDict.items(): tfDict[word] = count / float(bowCount) return tfDict
- 传入参数wordDict是包含字词及其出现频次的字典，bow是包含所有字词的列表
2. 提取IDF特征

idf即逆向文档频率(Inverse Document Frequency)，用来衡量一个词的普遍重要性，一般通过文档总数/包含该词汇的文档数，再取对数得到的值
def computeIDF(docList): import math idfDict = {} N = len(docList) idfDict = dict.fromkeys(docList[0].keys(), 0) for doc in docList: for word, val in doc.items(): if word in idfDict: if val > 0: idfDict[word] += 1 else: if val > 0: idfDict[word] = 1 for word, val in idfDict.items(): idfDict[word] = math.log10(N / float(val)) return idfDict
- 传入的参数为包含多个字词字典的列表，字典的键为单词，值就是含有该次的文档数
3. 提取TF-IDF特征

tf-idf即是tf * idf所得到的值，可以衡量某个词在所有文档中的信息量。假设有n个词的文档A，某个词的出现次数为t，且该词在w份文档中出现过，总共有x份文件
- 则tf = t / n，tf越大则说明该次在文档中的信息量越大
- 而idf = log(x / w)，idf越小则说明该词在所有文档中就越普遍不具有区分度
- 而tf-idf = (t / n) * (log(x / w))， w的值越小tf-idf的值反而越大则说明该词在文档中的信息量越大，更具有区分度
def computeTFIDF(tfBow, idfs): tfidf = {} for word, val in tfBow.items(): tfidf[word] = val * idfs[word] return tfidf
- 传入的参数为之前计算得到的包含tf和idf值的字典
4. 直接调用sklearn库的api生成TF-IDF词向量
from sklearn.feature_extraction.text import TfidfVectorizer count_vec = TfidfVectorizer(binary=False, decode_error='ignore', stop_words='english')
- 设定参数获得tfidf向量化实例count_vec，binary参数若为真表示将所有频次非0的tfidf值设置为1(而非输出设置为二元)
传入数据进行拟合然后转化为词向量的形式
s1 = 'I love you so much' s2 = 'I hate you! shit!' s3 = 'I like you, but just like you' response = count_vec.fit_transform([s1, s2, s3]) # s must be string print(count_vec.get_feature_names()) print(response.toarray())
输出去掉英文停用词后的结果如下

[‘shit’, ‘hate’, ‘just’, ‘like’, ‘love’]
[[0. 0. 0. 0. 1. ]
[0.70710678 0.70710678 0. 0. 0. ]
[0. 0. 0.4472136 0.89442719 0. ]]
查看全文

相关阅读:
入门MySQL——基础语句篇
 装饰者模式
 (19)IO流之字符流FileReader和FileWriter，缓冲字符流---缓冲输入字符流BufferedReader和缓冲输出字符流BufferedWriter
(18)IO流之字节缓冲路
 (17)IO中的异常处理
 (16)IO流之输入字节流FileInputStream和输出字节流FielOutputStream
(15)IO流之File
(14)jdk1.5开始的一些新特性：静态导入,增强for循环，可变参数，自动装箱/拆箱，枚举类型
 (13)正则表达式
 (12)泛型

原文地址：https://www.cnblogs.com/yunwangjun-python-520/p/13551687.html

特征工程·TFIDF提取特征

1. 提取tf特征

2. 提取IDF特征

3. 提取TF-IDF特征

4. 直接调用sklearn库的api生成TF-IDF词向量