zoukankan html css js c++ java

文本向量化

from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

stop_list = list(set(stopwords.words('english')))  # set()集合函数消除重复项

corpus = ['This is the first document.',    # 语料库
          'This is the second second document.',
          'And the third one.',
          'Is this the first document?']

# -----------------------------------
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)    # 向量化，得到词袋模型

print(X.toarray())
print(vectorizer.get_feature_names())

print()
# -----------------------------------

bigram_vectorizer = CountVectorizer(ngram_range=(1,3),  # N元特征
                                    stop_words = stop_list) # 停用词
X = bigram_vectorizer.fit_transform(corpus)

print(X.toarray())
print(bigram_vectorizer.get_feature_names())

print()
# ------------------------------------

analyze = vectorizer.build_analyzer()
print(analyze('This is a text document to analyze.'))

print(vectorizer.transform(['something completely new.',
                            'and this has something old.']).toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

[[1 1 1 0 0 0 0 0 0 0]
 [1 0 0 0 2 1 1 1 0 0]
 [0 0 0 1 0 0 0 0 1 1]
 [1 1 1 0 0 0 0 0 0 0]]
['document', 'first', 'first document', 'one', 'second', 'second document', 'second second', 'second second document', 'third', 'third one']

['this', 'is', 'text', 'document', 'to', 'analyze']
[[0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 1]]

CountVectorizer和TfidfVectorizer的参数：https://blog.csdn.net/du_qi/article/details/51564303
stopwords：https://www.cnblogs.com/webRobot/p/6079919.html

查看全文

相关阅读:
CodeForces 682B Alyona and Mex （排序+离散化）
CodeForces 682A Alyona and Numbers （水题）
CodeForces 682E Alyona and Triangles （计算几何）
CodeForces 176B Word Cut （计数DP）
CodeForces 173C Spiral Maximum （想法、模拟）
Spring源码剖析3：Spring IOC容器的加载过程
 Spring源码剖析2：初探Spring IOC核心流程
 深入理解JVM虚拟机13：再谈四种引用及GC实践
 深入理解JVM虚拟机12：JVM性能管理神器VisualVM介绍与实战
 深入理解JVM虚拟机11：Java内存异常原理与实践

原文地址：https://www.cnblogs.com/holaworld/p/12510477.html