zoukankan      html  css  js  c++  java
  • Python 中文文件统计词频 + 中文词云

    1. 词频统计:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 words  = jieba.lcut(txt)
     4 counts = {}
     5 for word in words:
     6     if len(word) == 1:
     7         continue
     8     else:
     9         counts[word] = counts.get(word,0) + 1
    10 items = list(counts.items())
    11 items.sort(key=lambda x:x[1], reverse=True)
    12 for i in range(15):
    13     word, count = items[i]
    14     print ("{0:<10}{1:>5}".format(word, count))

    结果是:

    曹操 946
    孔明 737
    将军 622
    玄德 585
    却说 534
    关公 509
    荆州 413
    二人 410
    丞相 405
    玄德曰 390
    不可 387
    孔明曰 374
    张飞 358
    如此 320
    不能 318

    进一步改进, 我想只知道人物出场统计,代码如下:

     1 import jieba
     2 txt = open("threekingdoms3.txt", "r", encoding='utf-8').read()
     3 names = {'曹操','孔明','刘备','关羽','张飞','吕布','赵云','孙权','周瑜','袁绍','黄忠','魏延'}
     4 words  = jieba.lcut(txt)
     5 counts = {}
     6 for word in words:
     7     if len(word) == 1:
     8         continue
     9     elif word == "诸葛亮" or word == "孔明曰":
    10         rword = "孔明"
    11     elif word == "关公" or word == "云长":
    12         rword = "关羽"
    13     elif word == "玄德" or word == "玄德曰":
    14         rword = "刘备"
    15     elif word == "孟德" or word == "丞相":
    16         rword = "曹操"
    17     else:
    18         rword = word
    19     counts[rword] = counts.get(rword,0) + 1
    20 # for word in excludes:
    21 #     del counts[word]
    22 items = list(counts.items())
    23 items.sort(key=lambda x:x[1], reverse=True)
    24 for i in range(40):
    25     word, count = items[i]
    26     if word in names:
    27         print ("{0:<10}{1:>5}".format(word, count))

    运行结果为:

    曹操 1358
    孔明 1265
    刘备 1251
    关羽 783
    张飞 358
    吕布 300
    赵云 278
    孙权 257
    周瑜 217
    袁绍 191

    进一步的做词云图:

     1 import jieba
     2 import os
     3 import wordcloud
     4  
     5 def getText(file):
     6     with open(file, 'r', encoding= 'UTF-8') as txt:
     7         txt = txt.read()
     8         jieba.lcut(txt)
     9     return txt
    10  
    11  
    12 directoryname =  os.getcwd()
    13 filename = input()
    14 txt = getText(filename + '.txt')
    15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt)
    16 wordclouds.to_file('{}.png'.format(filename))
    17  
    18 os.system('{}.png'.format(filename))

    名称是可以进一步优化的,参见第二部分代码。

    中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233

    参考:https://blog.csdn.net/weixin_44521703/article/details/93058003

  • 相关阅读:
    Kubernetes 1.5 配置dns
    详细图解,一眼就能看懂!卷帘快门(Rolling Shutter)与全局快门(Global Shutter)的区别
    把C#程序(含多个Dll)合并成一个Exe的超简单方法
    TortoiseSVN 合并操作简明教程
    简单说说.Net中的弱引用
    漫谈并发
    可靠UDP设计
    自动内存管理算法 —— 标记和复制法
    Unity防破解 —— 加密Dll与Key保护
    Unity防破解 —— 重新编译mono
  • 原文地址:https://www.cnblogs.com/116970u/p/11611821.html
Copyright © 2011-2022 走看看