zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    1.英文词频统

    下载一首英文的歌词或文章

    将所有,.?!’:等分隔符全部替换为空格

    news = '''
    歌手:Avril Lavigne(艾薇儿)
    歌词出处:http://www.5nd.com
    
    ╰☆╮Avril Lavigne - Smile╰☆╮
    Lyrics by Judy @ LK歌词组 QQ群:43882929
    You know that I'm a crazy bitch
    I do what I want, when I feel like it
    All I wanna do is lose control, oh oh
    But you don't really give a shit
    Ya go with it, go with it, go with it
    'Cause you're fuckin' crazy Rock 'N' Roll
    You-ou said "hey! what's your name?"
    It took one look and now I'm not the same
    Yeah, you said "Hey"
    And since that day
    You stole my heart and you're the one to blame
    Yeahhh and that's why I smile
    It's been a while
    Since everyday and everything has felt this right
    And now, you turn it all around
    And suddenly you're all I need the reason why
    I, I, I, I smile, ile, ile, ile
    Last night I blacked out I think
    What did you, what did you, put in my drink?
    I remember making out and then oh, oh
    I woke up with a new tattoo
    Your name was on me and my name was on you
    I would do it all over again
    You-ou said "hey what's your name?"
    It took one look and now I'm not the same
    Yeah, you said "Hey" (Hey)
    And since that day (and since that day)
    You stole my heart and you're the one to blame
    Yeahhh and that's why I smile
    It's been a while
    Since everyday and everything has felt this right
    And now, you turn it all around
    And suddenly you're all I need the reason why
    I, I, I, I smile, ile, ile, ile
    The reason why I, I, I, I smile, ile, ile, ile
    You know that I'm a crazy bitch
    I do what I want, when I feel like it
    All I wanna do is lose control
    You know that I'm a crazy bitch
    I do what I want, when I feel like it
    All I wanna do is lose control
    And that's why I smile
    It's been a while
    Since everyday and everything has felt this right
    And now, you turn it all around
    And suddenly you're all I need the reason why
    I, I, I, I smile, ile, ile, ile (the reason why)
    The reason why I, I, I, I smile, ile, ile, ile
    The reason why I, I, I, I smile, ile, ile, ile
    【 Avril Lavigne - Smile 】
    Lrc edited by Judy @ LK 歌词组
    '''
    
    sep = ''',.?!'":;,。?!:“”'''
    exclude = {'the','and','of','to'}
    
    for c in sep:
        news = news.replace(c,' ')
    

      

    将所有大写转换为小写,生成单词列表

    wordList = news.lower().split()
    for w in wordList:
        print(w)

    生成词频统计

    wordDist = {}
    wordSet = set(wordList)
    for w in wordSet:
        wordDist[w] = wordList.count(w)
     
    for w in wordDist:
        print(w, wordDist[w])
    

      

    排序

    dictList = list(wordDict.items())
    dictList.sort(key=lambda x:x[1],reverse=True)
    

      

    排除语法型词汇,代词、冠词、连词

    exclude = {'the','of','and','s','to','which','will','as','on','is','by',}
    
    wordSet=set(wordList)-exclude
    for w in wordSet:
        wordDist[w]=wordList.count(w)
    

      

    输出词频最大TOP20

    for i in range(20):
       print(dictList[i])
    

      

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    f = open('songs.txt','r',encoding='UTF-8')
    news = f.read()
    f.close()
    print(news)
    

    将排序结果放在songscount.txt文件中:

    f = open('songscount.txt','a')
    for i in range(20):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()
    

      

     

    2.中文词频统计

    下载一长篇中文文章。

    从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.lcut(news))

    import jieba
    file=open('hong.txt','r',encoding='utf-8')
    word=file.read()
    file.close()
    

      

    生成词频统计

    wordList=list(jieba.cut_for_search(word))
      
    wordDist={}
    for w in wordList:
        wordDist[w] = wordList.count(w)
      
    for w in wordDist:
        print(w, wordDist[w])
    

      

    排序

    dictList = list(wordDist.items())
    dictList.sort(key = lambda x: x[1], reverse=True)
    

      

    排除语法型词汇,代词、冠词、连词

    sep=''',。?“”:、?;!!'''
     
    exclude ={' ','
    ','了','的','u3000','他','我','也','又','是','你','着','这','就','都','呢','只'}
     
    for c in sep:
        word = word.replace(c,' ')
     
    wordSet=set(wordList)-exclude
    

      

    输出词频最大TOP20(或把结果存放到文件里)

    f=open('hongcount.txt','a')
    for i in range(20):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()
    

      

  • 相关阅读:
    android 布局 使用 viewPager 时,如何解决 和 子页面 长按滑动 冲突问题
    C++ 与 php 的交互 之----- C++ 异步获取 网页文字内容,异步获取 php 的 echo 值。
    站在巨人的肩膀上---重新自定义 android- ExpandableListView 收缩类,实现列表的可收缩扩展
    C/C++ char a[ ] 和 char *a 的差别,改变 char *a爆内存错误的原因
    android 真机调试出现错误 INSTALL_FAILED_INSUFFICIENT_STORAGE 的解决方法。
    android 如何获取手机的图片、视频、音乐
    《C程序设计语言》练习1-5
    《C 程序设计语言》练习1-4
    《C 程序设计语言》练习1-3
    关于 Cantor 集不可数的新观点
  • 原文地址:https://www.cnblogs.com/oechen/p/8666418.html
Copyright © 2011-2022 走看看