zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    1.英文词频统

    下载一首英文的歌词或文章,将所有,.?!’:等分隔符全部替换为空格

      

    news='''
    Guo Shuqing, head of the newly established China banking and insurance regulatory commission, was appointed Party secretary and vice-governor of the central bank on Monday, according to an announcement published on the People's Bank of China website.
    
    Guo, 61, former chairman of the China Banking Regulatory Commission, became Party secretary as well as chairman last week of the new banking and insurance regulatory commission, which combines the role of CBRC and the China Insurance Regulatory Commission.
    
    Yi Gang, 60, the newly elected central bank governor, was also appointed the Party's deputy chief of the central bank.
    
    Experts said former governors of the central bank also have held the title of Party chief, but the unusual arrangement will improve coordination between regulators of different sectors.
    
    Experts said the PBOC leadership adjustment could be in line with the country's newly restructured financial regulatory framework, on top of which is the cabinet-level financial stability and development committee established in November.
    
    It coordinates with the PBOC and two specialized supervision bodies-the newly merged banking and insurance regulatory commission, and the China Securities Regulatory Commission.
    
    As part of the State institutional reform plan approved by the first session of the 13th National People's Congress last week, the new watchdog for banking and insurance will be directly led by the State Council, China's Cabinet, which aims to strengthen regulation and prevent systemic financial risks, experts have said.
    
    Under the reform plan, functions and duties, including drafting key financial regulations and supervision of the basic financial system, will belong to the PBOC.
    
    Ming Ming, an analyst with CITIC Securities, said Guo's appointment is expected to solve existing problems with the goal of forestalling and defusing major risks.
    '''
    
    sep = ''',.?":;()'''
    for c in sep:
        news = news.replace(c,' ')
    

      

    将所有大写转换为小写,生成单词列表

    wordList = news.lower().split()
    for w in wordList:
        print(w)

    生成词频统计

    wordDist = {}
    wordSet = set(wordList)
    for w in wordSet:
        wordDist[w] = wordList.count(w)
    
    for w in wordDist:
        print(w, wordDist[w])
    

      

    排序

    dictList = list(wordDist.items())
    dictList.sort(key = lambda x: x[1], reverse=True)
    

      

    排除语法型词汇,代词、冠词、连词

    exclude = {'the','of','and','s','to','which','will','as','on','is','by',}
    wordSet=set(wordList)-exclude
    for w in wordSet:
        wordDist[w]=wordList.count(w)

    输出词频最大TOP20

    for i in range(20):
        print(dictList[i])
    

      

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

              读取news.txt文件:

    f=open('news.txt','r',encoding='utf-8')
    news=f.read()
    f.close()
    print(news)
    

        将排序结果放在newscount.txt文件中:

    f=open('newscount.txt','a')
    for i in range(25):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()
    

      

    2.中文词频统计

    下载一长篇中文文章。从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.cut(news))

    import jieba
    
    file=open('hong.txt','r',encoding='utf-8')
    word=file.read()
    file.close()
    

      

    生成词频统计

    wordList=list(jieba.cut_for_search(word))
    
    wordDist={}
    for w in wordList:
        wordDist[w] = wordList.count(w)
    
    for w in wordDist:
        print(w, wordDist[w])
    

      

    排序

    dictList = list(wordDist.items())
    dictList.sort(key = lambda x: x[1], reverse=True)

    排除语法型词汇,代词、冠词、连词

    sep=''',。?“”:、?;!!'''
    
    exclude ={' ','
    ','了','的','u3000','他','我','也','又','是','你','着','这','就','都','呢','只'}
    
    for c in sep:
        word = word.replace(c,' ')
    
    wordSet=set(wordList)-exclude
    

     

    输出词频最大TOP20(或把结果存放到文件里)

    f=open('hongcount.txt','a')
    for i in range(20):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()
    

      

  • 相关阅读:
    杭电 1176 免费馅饼
    IE 8 浏览器 F12 调试功能无法使用
    SqlServer 经常使用分页方法总结
    cocos2d-x 2.0下怎样让BOX2D DEBUG DRAW的方法笔记
    在DIV中自己主动换行
    linux之SQL语句简明教程---主键,外来键
    java数据库连接池技术简单使用
    Windows和linux双系统——改动默认启动顺序
    程序员实用的 MySQL sql 语句
    android 多项对话框
  • 原文地址:https://www.cnblogs.com/zhiling123/p/8661222.html
Copyright © 2011-2022 走看看