zoukankan      html  css  js  c++  java
  • Python 中文文件统计词频 + 中文词云

    1. 词频统计:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 words  = jieba.lcut(txt)
     4 counts = {}
     5 for word in words:
     6     if len(word) == 1:
     7         continue
     8     else:
     9         counts[word] = counts.get(word,0) + 1
    10 items = list(counts.items())
    11 items.sort(key=lambda x:x[1], reverse=True)
    12 for i in range(15):
    13     word, count = items[i]
    14     print ("{0:<10}{1:>5}".format(word, count))

    结果是:

    曹操 946
    孔明 737
    将军 622
    玄德 585
    却说 534
    关公 509
    荆州 413
    二人 410
    丞相 405
    玄德曰 390
    不可 387
    孔明曰 374
    张飞 358
    如此 320
    不能 318

    进一步改进, 我想只知道人物出场统计,代码如下:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 names = {'曹操','孔明','刘备','关羽','张飞','吕布','赵云','孙权','周瑜','袁绍','黄忠','魏延'}
     4 words  = jieba.lcut(txt)
     5 counts = {}
     6 for word in words:
     7     if len(word) == 1:
     8         continue
     9     elif word == "诸葛亮" or word == "孔明曰":
    10         rword = "孔明"
    11     elif word == "关公" or word == "云长":
    12         rword = "关羽"
    13     elif word == "玄德" or word == "玄德曰":
    14         rword = "刘备"
    15     elif word == "孟德" or word == "丞相":
    16         rword = "曹操"
    17     else:
    18         rword = word
    19     counts[rword] = counts.get(rword,0) + 1
    20 # for word in excludes:
    21 #     del counts[word]
    22 items = list(counts.items())
    23 items.sort(key=lambda x:x[1], reverse=True)
    24 for i in range(40):
    25     word, count = items[i]
    26     if word in names:
    27         print ("{0:<10}{1:>5}".format(word, count))

    运行结果为:

    曹操 1358
    孔明 1265
    刘备 1251
    关羽 783
    张飞 358
    吕布 300
    赵云 278
    孙权 257
    周瑜 217
    袁绍 191

    进一步的做词云图:

     1 import jieba
     2 import os
     3 import wordcloud
     4  
     5 def getText(file):
     6     with open(file, 'r', encoding= 'UTF-8') as txt:
     7         txt = txt.read()
     8         jieba.lcut(txt)
     9     return txt
    10  
    11  
    12 directoryname =  os.getcwd()
    13 filename = input()
    14 txt = getText(filename + '.txt')
    15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
    16 wordclouds.to_file('{}.png'.format(filename))
    17  
    18 os.system('{}.png'.format(filename))

    名称是可以进一步优化的,参见第二部分代码。

    中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233

    参考:https://blog.csdn.net/weixin_44521703/article/details/93058003

  • 相关阅读:
    es6箭头函数
    微信小程序入门
    浏览器常见错误代码
    nginx学习
    windows下mongodb安装与使用整理
    mongodb简单的增删改查
    github入门到上传本地项目
    Robomongo
    对象(面向对象、创建对象方式、Json)
    代码编辑器——Visual Studio Code
  • 原文地址:https://www.cnblogs.com/116970u/p/11611821.html
Copyright © 2011-2022 走看看