zoukankan      html  css  js  c++  java
  • 中文词频统计

    下载一长篇中文文章。

    从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.lcut(news))

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20

    import jieba
    
    f = open('novel.txt','r',encoding='utf-8')
    novel = f.read()
    f.close()
    
    exclude = { '
    ','u3000','-',' ','','','','','','','','','','','','',
                '','','','','','','','','','','','','','','','','',
                '','','','','','','便','','','','','','','','','','',
                '','','','','','','','','','','','','','','','','',
                '','','','','','','','','','','','','','','','','',
                '一个','','','','',''}
    
    sep = ''',。“”‘’’、?!:'''
    for c in sep:
        novel = novel.replace(c,' ')
    
    novels = list(jieba.lcut(novel))
    
    Dict= {}
    Set = set(novels) - exclude
    for w in Set:
        Dict[w] = novel.count(w)
    
    List = list(Dict.items())
    List.sort(key=lambda x:x[1],reverse=True)
    
    for i in range(20):
        print(List[i])

  • 相关阅读:
    猜数小游戏
    Please change caller according to com.intellij.openapi.project.IndexNotReadyException documentation。
    Android Studio —— Executing tasks
    C语言如何输出ASCII码
    Generator
    poj1919--Red and Black (DFS)
    poj1699--Best Sequence (DFS+查表)
    poj1753-Flip Game BFS+位运算
    Zombie 僵尸感染--BFS
    Java视频
  • 原文地址:https://www.cnblogs.com/wumeiying/p/8664830.html
Copyright © 2011-2022 走看看