zoukankan      html  css  js  c++  java
  • 中文词频统计与词云生成

    作业要求来源:https://edu.cnblogs.com/campus/gzcc/GZCC-16SE1/homework/2822

    1. 下载一长篇中文小说。

    2. 从文件读取待分析文本。

     f = open("红楼梦.txt", "r", encoding='gb18030')
        novel = f.read()
        f.close()
    

    3. 安装并使用jieba进行中文分词。

      

    4. 更新词库,加入所分析对象的专业词汇

      jieba.add_word('林姑娘')
        jieba.load_userdict(r'红楼梦词库.txt'

    5. 生成词频统计

    for i in tokens:
            wordsCount[i] = tokens.count(i)
    

    6. 排序

    top = list(wordsCount.items())
    top.sort(key=lambda x: x[1], reverse=True)
    

    7.排除语法型词汇,代词、冠词、连词等停用词

    f = open("stop_chinese.txt", "r", encoding='utf-8')
    stops = f.read().split()
    f.close()
    tokens = [token for token in novel if token not in stops]
    

    8. 输出词频最大TOP20,把结果存放到文件里

    top.sort(key=lambda x: x[1], reverse=True)
    pd.DataFrame(data=top[0:20]).to_csv('top_chinese20.csv', encoding='utf-8')
    

    9. 生成词云 

     

    txt = open('top_chinese20.csv','r',encoding='utf-8').read()
     wordlist = jieba.lcut(txt)
    
     wl_split  = ''.join(wordlist)
     backgroud_Image = plt.imread('background.jpg')
    
     mymc = WordCloud(background_color='white',mask=backgroud_Image,
                      margin=2,max_words=20,max_font_size=150,random_state=30).generate(wl_split)
      
     plt.imshow(mymc)
     plt.axis("off")
     plt.show()
     mymc.to_file(r'WordCloud.png')
    

     

     

     

  • 相关阅读:
    MAC上Vue的一些安装及配置
    MySQL
    git
    win7系统的用户怎么去掉用户账户控制?
    JS
    IDEA使用总结
    Mybatis
    codeforces cf educatonal round 57(div2) D. Easy Problem
    codeforces round#509(div2) E. Tree Reconstruction
    codeforces round#509(div2) D. Glider
  • 原文地址:https://www.cnblogs.com/liangqiuhua/p/10593246.html
Copyright © 2011-2022 走看看