zoukankan      html  css  js  c++  java
  • 中文词频统计及词云制作

    1、下载一中文长篇小说,并转换成UTF-8编码
    fo=open('test.txt','w')
    fo.write('''spend all your time waiting for that second chance
    for the break that will make it ok
    there's always some reason to feel not good enough
    and it's hard at the end of the day
    i need some distraction or a beautiful release
    memories seep from my veins
    let me be empty or weightless and maybe
    l'll find some peace tonight
    in the arms of the angel far away from here
    from this dark cold hotel room and the endlessness that you feel
    you are pulled from the wreckage of your silent reverie
    you are in the arms of the angel, may you find some comfort here''')
    fo.close()
    fo=open('test.txt','r')
    news=fo.read()
    news=news.lower()
    for i in '.,"':
        news=news.replace(i,' ')
    word=news.split(' ')
    dic={}
    exp={'','the','and','to','on','of','s','a','me','is'}
    keys=set(word)-exp
    '''print(keys)'''
    
    for i in keys:
        dic[i]=word.count(i)
    '''print(dic)'''
    
    a=list(dic.items())
    a.sort(key=lambda x:x[1],reverse=True)
    '''print(a)'''
    
    for i in range(10):
        print(a[i])
    fo.close()

    2、使用jieba库,进行中文词频统计,输出TOP20的词及出现次数。

    import jieba
    txt=open('jianai.txt','r',encoding='utf-8')
    jianai=txt.read()
    for i in ',.""!?':
        jianai=jianai.replace(i,' ')
    jianai=list(jieba.cut(jianai))
    ll={'','','','','','','离开','认为','这儿','即使','这样','等等'}
    dic={}
    keys=set(jianai)-ll
    for i in keys:
        dic[i]=jianai.count(i)
    items=list(dic.items())
    item.sort(keys=lambda x:x[1],reverse=True)
    for i in range(10):
        print(item[i])
    jianai.close()
  • 相关阅读:
    线程(java课堂笔记)
    java中的各种流(老师的有道云笔记)
    面向对象(java菜鸟的课堂笔记)
    泛型(java菜鸟的课堂笔记)
    我做的第一个程序(菜鸟的java课堂笔记)
    java中的一些规则(菜鸟的课堂笔记)
    一位菜鸟的java 最基础笔记
    spatial index (空间索引)
    hadoop 的疑问
    numpy 矩阵的运算
  • 原文地址:https://www.cnblogs.com/liulingyuan/p/7590848.html
Copyright © 2011-2022 走看看