下载一长篇中文文章。
从文件读取待分析文本。
news = open('gzccnews.txt','r',encoding = 'utf-8')
安装与使用jieba进行中文分词。
pip install jieba
import jieba
list(jieba.lcut(news))
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20
import jieba article = open('test.txt','r').read() dele = {'。','!','?','的','“','”','(',')',' ','》','《',','} jieba.add_word('大数据') words = list(jieba.cut(article)) articleDict = {} articleSet = set(words)-dele for w in articleSet: if len(w)>1: articleDict[w] = words.count(w) articlelist = sorted(articleDict.items(),key = lambda x:x[1], reverse = True) for i in range(10): print(articlelist[i])
运行截图: