zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    下载一首英文的歌词或文章

    将所有,.?!’:等分隔符全部替换为空格

    将所有大写转换为小写

    生成单词列表

    f=open('news.txt','r')
    news=f.read()
    f.close()
    sep=''',.'!"?:'''
    for c in sep:
       news=news.replace(c,' ')
       wordList=news.lower().split()
    
    for w in wordList:
          print(w)

    生成词频统计

    f=open('news.txt','r')
    news=f.read()
    f.close()
    sep=''',.'!"?:'''
    for c in sep:
       news=news.replace(c,' ')
       wordList=news.lower().split()
    wordDict={}
    wordSet=set(wordList)
    for w in wordSet:
        wordDict[w]=wordList.count(w)
    for w in wordDict:
          print(w,wordDict[w])

    排除语法型词汇,代词、冠词、连词

    f=open('news.txt','r')
    news=f.read()
    f.close()
    sep=''',.'!"?:'''
    exclude={'be','i','so','over','hearing'}
    for c in sep:
       news=news.replace(c,' ')
       wordList=news.lower().split()
    wordDict={}
    wordSet=set(wordList)-exclude
    for w in wordSet:
        wordDict[w]=wordList.count(w)
    for w in wordDict:
          print(w,wordDict[w])

    排序、输出词频最大TOP20

    f=open('news.txt','r')
    news=f.read()
    f.close()
    sep=''',.'!"?:'''
    exclude={'be','i','so','over','hearing'}
    for c in sep:
    news=news.replace(c,' ')
    wordList=news.lower().split()
    wordDict={}
    wordSet=set(wordList)-exclude
    for w in wordSet:
    wordDict[w]=wordList.count(w)

    dic=sorted(wordDict.items(),key=lambda d:d[1],reverse=True)
    print(dic)
    for i in range(20):
    print(dic[i])

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    f=open('news.txt','r')
    text=f.read()
    f.close()
    print(text)

    
    

     

  • 相关阅读:
    【今日CV 视觉论文速览】 19 Nov 2018
    【numpy求和】numpy.sum()求和
    【今日CV 视觉论文速览】16 Nov 2018
    【今日CV 视觉论文速览】15 Nov 2018
    poj 2454 Jersey Politics 随机化
    poj 3318 Matrix Multiplication 随机化算法
    hdu 3400 Line belt 三分法
    poj 3301 Texas Trip 三分法
    poj 2976 Dropping tests 0/1分数规划
    poj 3440 Coin Toss 概率问题
  • 原文地址:https://www.cnblogs.com/qq8675/p/8653829.html
Copyright © 2011-2022 走看看