NLP S实践 - 走看看

zoukankan html css js c++ java

NLP S实践
nlp

词袋

将文本数据表示为词袋，将字符串用数值表示。分为3个步骤：
(1) 分词, nltk
(2) 构建词表，将上一步分词结果构建成一个list
(3) 编码。sklearn.feature_extraction.CountVetorizer。将单词向量化，对比word2vec。

减少文本特征

(1) 删除停用词
(2) 计算tf-idf，舍弃被认为不重要的特征
(3) 通过提取词干，词形还原来减少特征
提取词干
PorterStemmer
LancasterStemmer
SnowballStemmer
词形还原
WordNetLemmatizer

文档主题建模

输出为主题对应的关键字，如
```
Topic 0 ==> 0.037*"cryptography" + 0.037*"lot" + 0.037*"spent" + 0.037*"studying"
Topic 1 ==> 0.075*"need" + 0.031*"order" + 0.031*"promoting" + 0.031*"talent"
```
隐含狄利克雷分布
sklearn.decomposition.LatentDirichletAllocation
gensim.models.ldamodel.LdaModel

文字情感分析

sklearn
NaiveBayesClassifier

keras
全连接网络

CNN
LSTM

gensim

models.ldamodel.LdaModel 隐含狄利克雷分布
doc2bow(), Convert document into the bag-of-words (BoW) format = list of (token_id, token_count)
查看全文

相关阅读:
Python数据结构与算法（几种排序）
jquery元素节点操作
 Jquery事件委托
 Jquery事件冒泡
 jquery事件
 尺寸相关、滚动事件
 jquery属性操作
 jquery选择器
 JavaScript面向对象
 jQuery powerFloat万能浮动层下拉层插件

原文地址：https://www.cnblogs.com/sunzhuli/p/9696735.html

Copyright © 2011-2022 走看看