zoukankan      html  css  js  c++  java
  • Python 中文文件统计词频 + 中文词云

    1. 词频统计:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 words  = jieba.lcut(txt)
     4 counts = {}
     5 for word in words:
     6     if len(word) == 1:
     7         continue
     8     else:
     9         counts[word] = counts.get(word,0) + 1
    10 items = list(counts.items())
    11 items.sort(key=lambda x:x[1], reverse=True)
    12 for i in range(15):
    13     word, count = items[i]
    14     print ("{0:<10}{1:>5}".format(word, count))

    结果是:

    曹操 946
    孔明 737
    将军 622
    玄德 585
    却说 534
    关公 509
    荆州 413
    二人 410
    丞相 405
    玄德曰 390
    不可 387
    孔明曰 374
    张飞 358
    如此 320
    不能 318

    进一步改进, 我想只知道人物出场统计,代码如下:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 names = {'曹操','孔明','刘备','关羽','张飞','吕布','赵云','孙权','周瑜','袁绍','黄忠','魏延'}
     4 words  = jieba.lcut(txt)
     5 counts = {}
     6 for word in words:
     7     if len(word) == 1:
     8         continue
     9     elif word == "诸葛亮" or word == "孔明曰":
    10         rword = "孔明"
    11     elif word == "关公" or word == "云长":
    12         rword = "关羽"
    13     elif word == "玄德" or word == "玄德曰":
    14         rword = "刘备"
    15     elif word == "孟德" or word == "丞相":
    16         rword = "曹操"
    17     else:
    18         rword = word
    19     counts[rword] = counts.get(rword,0) + 1
    20 # for word in excludes:
    21 #     del counts[word]
    22 items = list(counts.items())
    23 items.sort(key=lambda x:x[1], reverse=True)
    24 for i in range(40):
    25     word, count = items[i]
    26     if word in names:
    27         print ("{0:<10}{1:>5}".format(word, count))

    运行结果为:

    曹操 1358
    孔明 1265
    刘备 1251
    关羽 783
    张飞 358
    吕布 300
    赵云 278
    孙权 257
    周瑜 217
    袁绍 191

    进一步的做词云图:

     1 import jieba
     2 import os
     3 import wordcloud
     4  
     5 def getText(file):
     6     with open(file, 'r', encoding= 'UTF-8') as txt:
     7         txt = txt.read()
     8         jieba.lcut(txt)
     9     return txt
    10  
    11  
    12 directoryname =  os.getcwd()
    13 filename = input()
    14 txt = getText(filename + '.txt')
    15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
    16 wordclouds.to_file('{}.png'.format(filename))
    17  
    18 os.system('{}.png'.format(filename))

    名称是可以进一步优化的,参见第二部分代码。

    中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233

    参考:https://blog.csdn.net/weixin_44521703/article/details/93058003

  • 相关阅读:
    mysql语句-DDL语句
    Web框架本质
    HTTP协议那些事儿(Web开发补充知识点)
    利用random模块生成验证码
    前端小练习
    常用模块collections
    强大的图片展示插件,JQuery图片预览展示插件
    笔记本电脑清除BIOS密码
    js中的new Option默认选中
    使用PHPMailer发送邮件
  • 原文地址:https://www.cnblogs.com/116970u/p/11611821.html
Copyright © 2011-2022 走看看