zoukankan html css js c++ java

042 实例10-文本词频统计

一、"文本词频统计"问题分析
- 1.1 问题分析
二、"Hamlet英文词频统计"实例讲解
三、"《三国演义》人物出场统计"实例讲解(上)
四、"《三国演义》人物出场统计"实例讲解(下)
- 4.1 《三国演义》人物出场统计
五、"文本词频统计"举一反三
- 5.1 应用问题的扩展

一、"文本词频统计"问题分析

1.1 问题分析

文本词频统计

需求：一篇文章，出现了哪些词？哪些词出现得最多？
该怎么做呢？

英文文本 --> 中文文本

英文文本：Hamlet 分析词频

想要《Hamlet》文本的同学可以加我微信：nickchen121

中文文本：《三国演义》分析人物

想要《三国演义》文本的同学可以加我微信：nickchen121

二、"Hamlet英文词频统计"实例讲解

042-实例10-文本词频统计-01.jpg?x-oss-process=style/watermark

文本去噪及归一化
使用字典表达词频

# CalHamletV1.py


def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
        txt = txt.replace(ch, " ")
    return txt


hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
    counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

the         948
and         855
to          650
of          581
you         494
a           468
my          447
i           443
in          373
hamlet      361

运行结果由大到小排序
观察单词出现次数

三、"《三国演义》人物出场统计"实例讲解(上)

042-实例10-文本词频统计-02.jpg?x-oss-process=style/watermark

中文文本分词
使用字典表达词频

# CalThreeKingdomsV1.py

import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/mh/krrg51957cqgl0rhgnwyylvc0000gn/T/jieba.cache
Loading model cost 1.030 seconds.
Prefix dict has been built succesfully.


曹操          953
孔明          836
将军          772
却说          656
玄德          585
关公          510
丞相          491
二人          469
不可          440
荆州          425
玄德曰         390
孔明曰         390
不能          384
如此          378
张飞          358

四、"《三国演义》人物出场统计"实例讲解(下)

4.1 《三国演义》人物出场统计

将词频与人物相关联，面向问题

词频统计 --> 人物统计

#CalThreeKingdomsV2.py
import jieba
txt = open("threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此"}
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

曹操         1451
孔明         1383
刘备         1252
关羽          784
张飞          358
商议          344
如何          338
主公          331
军士          317
吕布          300

中文文本分词
使用字典表达词频
扩展程序解决问题
根据结果进一步优化

隆重发布《三国演义》人物出场顺序前20：曹操、孔明、刘备、关羽、张飞、吕布、赵云、孙权、司马懿、周瑜、袁绍、马超、魏延、黄忠、姜维、马岱、庞德、孟获、刘表、夏侯惇

五、"文本词频统计"举一反三

5.1 应用问题的扩展

《红楼梦》、《西游记》、《水浒传》…
政府工作报告、科研论文、新闻报道 …
进一步呢？未来还有词云…

查看全文

相关阅读:
idea git使用记录
 会计科目中的借贷理解
 git使用合集
 Unsupported major.minor version 52.0
sts问题合集
 idea导入工程
 （转）解决mybatis的mapper.xml查询不出数据，结果一直为null问题
 战争热诚的python全栈开发之路
 Python机器学习笔记：SVM（4）——sklearn实现
 Python机器学习笔记：SVM（3）——证明SVM

原文地址：https://www.cnblogs.com/nickchen121/p/11200520.html