zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    1.英文词频统计:

    下载一首英文的歌词或文章

    song = ''' Passion is sweet
    Love makes weak
    You said you cherised freedom so
    You refused to let it go
    Follow your faith 
    Love and hate
    never failed to seize the day
    Don't give yourself away
    Oh when the night falls
    And your all alone
    In your deepest sleep 
    What are you dreeeming of
    My skin's still burning from your touch
    Oh I just can't get enough 
    I said I wouldn't ask for much
    But your eyes are dangerous
    So the tought keeps spinning in my head
    Can we drop this masquerade
    I can't predict where it ends
    If you're the rock I'll crush against
    Trapped in a crowd
    Music's loud
    I said I loved my freedom too
    Now im not so sure i do
    All eyes on you
    Wings so true
    Better quit while your ahead
    Now im not so sure i am
    Oh when the night falls
    And your all alone
    In your deepest sleep
    What are you dreaming of
    My skin's still burning from your touch
    Oh I just can't get enough
    I said I wouldn't ask for much
    But your eyes are dangerous
    So the thought keeps spinning in my head
    Can we drop this masquerade 
    I can't predict where it ends
    If you're the rock I'll crush against
    My soul, my heart
    If your near or if your far
    My life, my love
    You can have it all
    Oh when the night falls
    And your all alone
    In your deepest sleep
    What are you dreaming of
    My skin's still burning from your touch
    Oh I just can't get enough
    I said I wouldn't ask for much
    But your eyes are dangerous 
    So the thought keeps spinning in my head
    Can we drop this masquerade
    I can't predict where it ends
    If you're the rock I'll crush against
    If you're the rock i'll crush against '''

    将所有,.?!’:等分隔符全部替换为空格

    sep = ''',.?';'"'''
    for i in sep:
        song.replace(i," ")

    将所有大写转换为小写,生成单词列表

    songList =  song.lower().split()

    生成词频统计

    countdict = {}
    songset = set(songList)
    
    for i in songset:
        countdict[i] = songList.count(i)
    for i in countdict:
        print(i,countdict[i])

    排序

    dictList = list(countdict.items())
    dictList.sort(key = lambda x:x[1],reverse = True)

    排除语法型词汇,代词、冠词、连词

    delList = {"the","a""an"}
    songset = set(songList) - delList

    输出词频最大TOP20

    for i in range(20):
        print(dictList[i])

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    读取歌词:

    f = open("F:/study/大三/大数据/song.txt","r")
    song = f.read();
    f.close()

    保存分析结果:

    f = open("F:/study/大三/大数据/resulet.txt","a")
    for i in range(20):
        f.write('
    '+dictList[i][0]+" "+str(dictList[i][1]))
    f.close()

    实验结果:

           

    2.中文词频统计:

    下载一长篇中文文章。

    从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.lcut(news))

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20(或把结果存放到文件里)

    import jieba
    f = open("F:study大三大数据中文词频.txt","r")
    str1 = f.read()
    stringList =list(jieba.cut(str1))
    
    delset = {"","","","","",""," ","","",""}
    stringset = set(stringList) - delset
    
    countdict = {}
    for i in stringset:
        countdict[i] = stringList.count(i)
    
    dictList = list(countdict.items())
    dictList.sort(key = lambda x:x[1],reverse = True)
    
    f = open("F:/study/大三/大数据/resulet.txt", "a")
    for i in range(20):
     f.write('
    ' + dictList[i][0] + " " + str(dictList[i][1]))
    f.close()

  • 相关阅读:
    [Python] Unofficial Windows Binaries for Python Extension Packages
    [SublimeText] 之 Packages
    [Windows] Windows 8.x 取消触摸板切换界面
    [Shell] Backtick vs $() 两种方式内嵌值
    [OSX] 在 OS X 中安装 MacPorts 指南
    [OSX] 使用 MacPorts 安装 Python 和 pip 指南
    关于 g++ 编译器
    梦想成真,喜获微软MVP奖项,微软MVP FAQ?
    拥抱.NET Core,如何开发一个跨平台类库 (1)
    拥抱.NET Core,学习.NET Core的基础知识补遗
  • 原文地址:https://www.cnblogs.com/Ming-jay/p/8658462.html
Copyright © 2011-2022 走看看