CountVectorizer()类解析

zoukankan html css js c++ java

CountVectorizer()类解析
主要可以参考下面几个链接：

1.sklearn文本特征提取

2.使用scikit-learn tfidf计算词语权重

3.sklearn官方中文文档

4.sklearn.feature_extraction.text.CountVectorizer

补充一下：CounterVectorizer()类的函数transfome()的用法

它主要是把新的文本转化为特征矩阵，只不过，这些特征是已经确定过的。而这个特征序列是前面的fit_transfome()输入的语料库确定的特征。见例子：
1 >>>from sklearn.feature_extraction.text import CountVectorizer 2 >>>vec=CountVectrizer() 3 >>>vec.transform(['Something completely new.']).toarray()
错误返回，sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted.表示没有对应的词汇表，这个文本无法转换。其实就是没有建立vocabulary表，没法对文本按照矩阵索引来统计词的个位数
corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?'] X = vec.fit_transform(corpus) X.toarray()
　vocabulary列表
>>>vec.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
　得到的稀疏矩阵是
array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
建立vocabulary后可以用transform（）来对新文本进行矩阵化了
>>>vec.transform(['this is']).toarray() array([[0, 0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64) >>>vec.transform(['too bad']).toarray() array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)
简单分析'this is'在vocabulary表里面，则对应词统计数量，形成矩阵。而'too bad'在vocabulary表中没有这两词，所以矩阵都为0.
查看全文

相关阅读:
Procedure execution failed 2013
struts2总结四:Action与Form表单的交互
 JQuery中的DOM操作
 easyui提交表单数据的时候如何防止二次提交
 一句话
 字符串截取函数substr和substring的不同及其相关说明
 四句话
 JAVA定时执行任务,每天定时几点钟执行任务
 JAVA定时执行任务的三种方法
 struts2总结三：struts2配置文件struts.xml的简单总结

原文地址：https://www.cnblogs.com/zz22--/p/9454234.html