综合练习：词频统计

zoukankan html css js c++ java

综合练习：词频统计
综合练习

词频统计预处理

下载一首英文的歌词或文章

将所有,.？！’:等分隔符全部替换为空格
sep=''',.?!'":'''
for a in sep:
    news = news.replace(a,' ')

print(news)

将所有大写转换为小写

sep=''',.?'":'''
for a in sep:
    news = news.lower().replace(a,' ')

print(news)

生成单词列表

sep=''',.?'":'''
for a in sep:
    news = news.replace(a,' ')
wordList=news.lower().split()
for w in wordList:
    print(w)

生成词频统计
```
sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

for w in wordDict:

    print(w, wordDict[w])
```
排序
```
sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)

print(dictList)
```
排除语法型词汇，代词、冠词、连词
```
exclude = {'the','and','of','to'}

sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)-exclude

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)

print(dictList)
```
输出词频最大TOP20
```
sep=''',.?'":'''

for a in sep:

    news = news.replace(a,' ')

wordList=news.lower().split()



for a in sep:

    news = news.lower().replace(a,' ')

wordList=news.split()

wordDict = {}

wordSet = set(wordList)

for w in wordSet:

    wordDict[w] = wordList.count(w)

dictList = list(wordDict.items())

dictList.sort(key=lambda x:x[1],reverse=True)
```
```
for i in range(20):
```
```
print(dictList[i])
```
将分析对象存为utf-8编码的文件，通过文件读取的方式获得词频分析内容。

2.中文词频统计

下载一长篇中文文章。

从文件读取待分析文本。

f = open('hongluomeng.txt','r', encoding='utf-8')

安装与使用jieba进行中文分词。

for i in g:
    text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)

生成词频统计

排序

排除语法型词汇，代词、冠词、连词

输出词频最大TOP20（或把结果存放到文件里）

import jieba

f = open('hongluomeng.txt','r', encoding='utf-8')
text = f.read()
f.close()

g = '''，。‘’“”：；（）！？、'''
a = {
    '的', ' ',
     '曰', '之', '不', '人', '一', '大', '马', '来', '有', '于', '下', '此',
     }
for i in g:
    text = text.replace(i, '')
print(list(jieba.cut(text)))
b = list(jieba.lcut(text))
print(b)
count = {}
q = list(set(b) - a)
print(q)

for i in range(0, len(q)):
    count[q[i]] = text.count(str(q[i]))

r = list(count.items())
r.sort(key=lambda x: x[1], reverse=True)
print(r)

f = open('hlmCount.txt', 'a')
for i in range(20):
    f.write(r[i][0] + ':' + str(r[i][1]) + ' ')
f.close()
查看全文

相关阅读:
CAS与ABA问题产生和解决
 OnCheckedChangeListener和setChecked之间冲突问题解决
 【二】在一个二维数组中，每一行都按照从左到右递增的顺序排序，每一列都按照从上到下递增的顺序排序。请完成一个函数，输入这样的一个二维数组和一个整数，判断数组中是否函数该整数。
【一】设计一个类，我们只能生成该类的一个实例。
深入学习semaphore
RF-For循环使用
 【RF库Collections测试】List Should Contain Value
RF采用SSHLibary库执行sudo命令，提示sudo: sorry, you must have a tty to run sudo错误的解决办法
 【RF库Collections测试】List Should Contain Value
RF判断列表、字典、整数、字符串类型是否相同方法

原文地址：https://www.cnblogs.com/Brilliance-pan/p/8666659.html