zoukankan      html  css  js  c++  java
  • 中文词频统计与词云生成

    作业来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2822

    任务:

    1. 下载一长篇中文小说

    2. 从文件读取待分析文本

    1 novel = open(r'E:三体.txt', 'r', encoding='utf8').read()

    3. 安装并使用jieba进行中文分词

    pip install jieba

    import jieba

    jieba.lcut(text)

    4. 更新词库,加入所分析对象的专业词汇

    jieba.load_userdict(word_dict)  #词库文本文件

    jieba.load_userdict(r'E:三体词库.txt')

    参考词库下载地址:https://pinyin.sogou.com/dict/

    转换代码:scel_to_text

    5. 生成词频统计、排序

    1 wordSet = set(tokens)
    2 
    3 wordDict = {}
    4 for w in wordSet:
    5     if len(w) > 1:
    6         wordDict[w] = tokens.count(w)
    7 wordList = list(wordDict.items())
    8 wordList.sort(key=lambda x: x[1], reverse=True)

    6. 排除语法型词汇,代词、冠词、连词等停用词

    stops

    tokens=[token for token in wordsls if token not in stops]

    1 with open(r'E:stops_chinese.txt', 'r', encoding='utf8') as f:
    2     stops = f.read().split('
    ')
    3 tokens = [token for token in cutText if token not in stops]

    7. 输出词频最大TOP25,把结果存放到文件里

    1 for i in range(25):
    2     print(wordList[i])

    8. 生成词云

    install wordcloud:

     

    相关代码:

    1 pd.DataFrame(data=wordList).to_csv('E:\三体词频统计.csv', encoding='utf8')
    2 cut_text = "".join(tokens)
    3 im = imread(r'E:	ree.jpg')
    4 mywc = WordCloud(background_color='white', mask=im, margin=2).generate(cut_text)
    5 plt.imshow(mywc)
    6 plt.axis("off")
    7 plt.show()

    9 完整代码:

     1 import jieba
     2 import pandas as pd
     3 from wordcloud import WordCloud
     4 import matplotlib.pyplot as plt
     5 from scipy.misc import imread
     6 
     7 novel = open(r'E:三体.txt', 'r', encoding='utf8').read()
     8 jieba.load_userdict(r'E:三体词库.txt')
     9 cutText = jieba.lcut(novel)
    10 
    11 with open(r'E:stops_chinese.txt', 'r', encoding='utf8') as f:
    12     stops = f.read().split('
    ')
    13 tokens = [token for token in cutText if token not in stops]
    14 wordSet = set(tokens)
    15 
    16 wordDict = {}
    17 for w in wordSet:
    18     if len(w) > 1:
    19         wordDict[w] = tokens.count(w)
    20 wordList = list(wordDict.items())
    21 wordList.sort(key=lambda x: x[1], reverse=True)
    22 
    23 for i in range(25):
    24     print(wordList[i])
    25 
    26 
    27 pd.DataFrame(data=wordList).to_csv('E:\三体词频统计.csv', encoding='utf8')
    28 cut_text = "".join(tokens)
    29 im = imread(r'E:	ree.jpg')
    30 mywc = WordCloud(background_color='white', mask=im, margin=2).generate(cut_text)
    31 plt.imshow(mywc)
    32 plt.axis("off")
    33 plt.show()

    运行结果:

           

  • 相关阅读:
    mysql 函数 存储过程 事件(event) job 模板
    protobuf 无proto 解码 decode 语言 java python
    mitmproxy fiddler 抓包 填坑
    android adb 常用命令
    android机器人 模拟 踩坑过程
    RabbitMQ添加新用户并支持远程访问
    Windows下RabbitMQ安装及配置
    Java mybatis mysql 常用数据类型对应关系
    easyExcel 踩坑
    linux防火墙查看状态firewall、iptable
  • 原文地址:https://www.cnblogs.com/Aliuyu/p/10595474.html
Copyright © 2011-2022 走看看