作业来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2822
任务:
1. 下载一长篇中文小说
2. 从文件读取待分析文本
1 novel = open(r'E:三体.txt', 'r', encoding='utf8').read()
3. 安装并使用jieba进行中文分词
pip install jieba
import jieba
jieba.lcut(text)
4. 更新词库,加入所分析对象的专业词汇
jieba.load_userdict(word_dict) #词库文本文件
jieba.load_userdict(r'E:三体词库.txt')
参考词库下载地址:https://pinyin.sogou.com/dict/
转换代码:scel_to_text
5. 生成词频统计、排序
1 wordSet = set(tokens) 2 3 wordDict = {} 4 for w in wordSet: 5 if len(w) > 1: 6 wordDict[w] = tokens.count(w) 7 wordList = list(wordDict.items()) 8 wordList.sort(key=lambda x: x[1], reverse=True)
6. 排除语法型词汇,代词、冠词、连词等停用词
stops
tokens=[token for token in wordsls if token not in stops]
1 with open(r'E:stops_chinese.txt', 'r', encoding='utf8') as f: 2 stops = f.read().split(' ') 3 tokens = [token for token in cutText if token not in stops]
7. 输出词频最大TOP25,把结果存放到文件里
1 for i in range(25): 2 print(wordList[i])
8. 生成词云
install wordcloud:
相关代码:
1 pd.DataFrame(data=wordList).to_csv('E:\三体词频统计.csv', encoding='utf8') 2 cut_text = "".join(tokens) 3 im = imread(r'E: ree.jpg') 4 mywc = WordCloud(background_color='white', mask=im, margin=2).generate(cut_text) 5 plt.imshow(mywc) 6 plt.axis("off") 7 plt.show()
9 完整代码:
1 import jieba 2 import pandas as pd 3 from wordcloud import WordCloud 4 import matplotlib.pyplot as plt 5 from scipy.misc import imread 6 7 novel = open(r'E:三体.txt', 'r', encoding='utf8').read() 8 jieba.load_userdict(r'E:三体词库.txt') 9 cutText = jieba.lcut(novel) 10 11 with open(r'E:stops_chinese.txt', 'r', encoding='utf8') as f: 12 stops = f.read().split(' ') 13 tokens = [token for token in cutText if token not in stops] 14 wordSet = set(tokens) 15 16 wordDict = {} 17 for w in wordSet: 18 if len(w) > 1: 19 wordDict[w] = tokens.count(w) 20 wordList = list(wordDict.items()) 21 wordList.sort(key=lambda x: x[1], reverse=True) 22 23 for i in range(25): 24 print(wordList[i]) 25 26 27 pd.DataFrame(data=wordList).to_csv('E:\三体词频统计.csv', encoding='utf8') 28 cut_text = "".join(tokens) 29 im = imread(r'E: ree.jpg') 30 mywc = WordCloud(background_color='white', mask=im, margin=2).generate(cut_text) 31 plt.imshow(mywc) 32 plt.axis("off") 33 plt.show()
运行结果: