zoukankan      html  css  js  c++  java
  • 中文词频统计

    下载一长篇中文文章。

    从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.lcut(news))

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20

    import jieba
    
    fo = open('douluo.txt','r',encoding='utf-8').read()
    
    wordsls = jieba.lcut(fo)
    wcdict = {}
    # for word in wordsls:
    #   if len(word)==1:
    #    continue
    #   else:
    #    wcdict[word]=wcdict.get(word,0)+1
    for i in set(wordsls):
        wcdict[i]=wordsls.count(i)
        delete={'','','','自己','','已经','','','','没有','','他们','','我们','','什么','一个',
                '','','','','','','','','','','','','','','','','','','','',
                '','','','','','','','','','','','','','','','','','','','',
                '','','','','','','','','','','','','','','','','','','',
                '','','','',' ','','-','\n','','','','','','','','','','.','',''}
    for i in delete:
        if i in wcdict:
            del wcdict[i]
    sort_word = sorted(wcdict.items(), key = lambda d:d[1], reverse = True)  # 排序
    for i in range(20):  # 输出
        print(sort_word[i])
    
    # fo = open("douluo1.txt", "r",encoding='utf-8')
    # print ("文件名为: ", fo.name)
    # for index in range(5):
    #     line = next(fo)
    #     print ("第 %d 行 - %s" % (index, line))
    #
    # # 关闭文件
    # fo.close()

  • 相关阅读:
    线性回归
    [C0] 引言(Introduction)
    [C5W2] Sequence Models
    [C5W3] Sequence Models
    [C4W4] Convolutional Neural Networks
    [C4W3] Convolutional Neural Networks
    [C4W2] Convolutional Neural Networks
    折腾ELK+kafka+zk
    helm 安装prometheus operator 并监控ingress
    练习calico的网络policy
  • 原文地址:https://www.cnblogs.com/chenguangpeng/p/8663251.html
Copyright © 2011-2022 走看看