zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    1.英文词频统

    下载一首英文的歌词或文章,将所有,.?!’:等分隔符全部替换为空格

      

    news='''
    Guo Shuqing, head of the newly established China banking and insurance regulatory commission, was appointed Party secretary and vice-governor of the central bank on Monday, according to an announcement published on the People's Bank of China website.
    
    Guo, 61, former chairman of the China Banking Regulatory Commission, became Party secretary as well as chairman last week of the new banking and insurance regulatory commission, which combines the role of CBRC and the China Insurance Regulatory Commission.
    
    Yi Gang, 60, the newly elected central bank governor, was also appointed the Party's deputy chief of the central bank.
    
    Experts said former governors of the central bank also have held the title of Party chief, but the unusual arrangement will improve coordination between regulators of different sectors.
    
    Experts said the PBOC leadership adjustment could be in line with the country's newly restructured financial regulatory framework, on top of which is the cabinet-level financial stability and development committee established in November.
    
    It coordinates with the PBOC and two specialized supervision bodies-the newly merged banking and insurance regulatory commission, and the China Securities Regulatory Commission.
    
    As part of the State institutional reform plan approved by the first session of the 13th National People's Congress last week, the new watchdog for banking and insurance will be directly led by the State Council, China's Cabinet, which aims to strengthen regulation and prevent systemic financial risks, experts have said.
    
    Under the reform plan, functions and duties, including drafting key financial regulations and supervision of the basic financial system, will belong to the PBOC.
    
    Ming Ming, an analyst with CITIC Securities, said Guo's appointment is expected to solve existing problems with the goal of forestalling and defusing major risks.
    '''
    
    sep = ''',.?":;()'''
    for c in sep:
        news = news.replace(c,' ')
    

      

    将所有大写转换为小写,生成单词列表

    wordList = news.lower().split()
    for w in wordList:
        print(w)

    生成词频统计

    wordDist = {}
    wordSet = set(wordList)
    for w in wordSet:
        wordDist[w] = wordList.count(w)
    
    for w in wordDist:
        print(w, wordDist[w])
    

      

    排序

    dictList = list(wordDist.items())
    dictList.sort(key = lambda x: x[1], reverse=True)
    

      

    排除语法型词汇,代词、冠词、连词

    exclude = {'the','of','and','s','to','which','will','as','on','is','by',}
    wordSet=set(wordList)-exclude
    for w in wordSet:
        wordDist[w]=wordList.count(w)

    输出词频最大TOP20

    for i in range(20):
        print(dictList[i])
    

      

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

              读取news.txt文件:

    f=open('news.txt','r',encoding='utf-8')
    news=f.read()
    f.close()
    print(news)
    

        将排序结果放在newscount.txt文件中:

    f=open('newscount.txt','a')
    for i in range(25):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()
    

      

    2.中文词频统计

    下载一长篇中文文章。从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.cut(news))

    import jieba
    
    file=open('hong.txt','r',encoding='utf-8')
    word=file.read()
    file.close()
    

      

    生成词频统计

    wordList=list(jieba.cut_for_search(word))
    
    wordDist={}
    for w in wordList:
        wordDist[w] = wordList.count(w)
    
    for w in wordDist:
        print(w, wordDist[w])
    

      

    排序

    dictList = list(wordDist.items())
    dictList.sort(key = lambda x: x[1], reverse=True)

    排除语法型词汇,代词、冠词、连词

    sep=''',。?“”:、?;!!'''
    
    exclude ={' ','
    ','了','的','u3000','他','我','也','又','是','你','着','这','就','都','呢','只'}
    
    for c in sep:
        word = word.replace(c,' ')
    
    wordSet=set(wordList)-exclude
    

     

    输出词频最大TOP20(或把结果存放到文件里)

    f=open('hongcount.txt','a')
    for i in range(20):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()
    

      

  • 相关阅读:
    【秋招必备】Redis面试题(2021最新版)
    【秋招必备】Spring Boot面试题(2021最新版)
    【秋招必备】Java基础知识面试题(2021最新版)
    用友二面:如何设计一个高可用、高并发秒杀系统
    万字长文,带你深入理解Java虚拟机!
    小米面试官:说说Spring源码里面的Bean的生命周期!
    苏宁易购三面:写一个脚本获取Linux系统CPU的详细信息,并说出原理!
    易错点。
    APP间传递消息
    KVC, KVO
  • 原文地址:https://www.cnblogs.com/zhiling123/p/8661222.html
Copyright © 2011-2022 走看看