zoukankan      html  css  js  c++  java
  • 课程作业——综合练习:英文词频统计

    我在统计词频的过程中是使用了amazing grace的歌词,歌词大致如下:

    Amazing Grace, how sweet the sound,
    That saved a wretch like me....
    I once was lost but now am found,
    Was blind, but now, I see.
    
    T'was Grace that taught...
    my heart to fear.
    And Grace, my fears relieved.
    How precious did that Grace appear...
    the hour I first believed.
    
    Through many dangers, toils and snares...
    we have already come.
    T'was Grace that brought us safe thus far...
    and Grace will lead us home.
    
    The Lord has promised good to me...
    His word my hope secures.
    He will my shield and portion be...
    as long as life endures.
    
    When we've been here ten thousand years...
    bright shining as the sun.
    We've no less days to sing God's praise...
    then when we've first begun.
    
    Amazing Grace, how sweet the sound,
    That saved a wretch like me....
    I once was lost but now am found,
    Was blind, but now, I see.
    

    我将上面的歌词放入一个lyric.txt的文件中,现在就可以进行代码编写,代码如下:

    file = open('lyric.txt', 'r')  # 只读打开文件
    lyrics = file.read()  # 读取文件内容
    file.close()  # 关闭文件资源
    lyrics = lyrics.lower()  # 设置内容全部小写
    
    # 将内容中的连词,冠词等去掉
    lyrics = lyrics.replace('the', '').replace('was', 'wes').replace('as', '').replace('wes', 'was').replace('but', '') 
        .replace('that', '').replace('and', ' ').replace('to', ' ').replace("we've", 'we').replace("we're", 'we')
    lyrics = lyrics.replace(',', ' ').replace('.', ' ')
    # print(lyrics)
    # print('-' * 100)
    
    lyricsWordList = lyrics.split()  # 将歌词全部转为单个单词为元素的list
    lyricsWordSet = set(lyricsWordList)  # 将list转为set,可以将重复元素去掉
    # print(lyricsWordList)
    # print('-' * 100)
    # print(lyricsWordSet)
    # print('-' * 100)
    lyricsDict = {}  # 初始化歌词字典
    # 遍历歌词set(因为set的key不重复,所以使用key作为统计基准)
    for word in lyricsWordSet:
        lyricsDict[word] = lyricsWordList.count(word)  # 字典的key使用set的key,value使用list中统计的出现个数
    # print(lyricsDict)
    # print('-' * 100)
    lyricsWordListSort = sorted(lyricsDict.items(), key=lambda d: d[1], reverse=True)  # 将字典中的各个点变成排序完成的list
    # print(lyricsWordListSort)
    # print('-' * 100)
    
    # 打印前十最高词频
    for i in range(10):
        if i >= len(lyricsWordListSort):
            break
        print(lyricsWordListSort[i])
    
    
    # 打印结果:
    # ('grace', 7)
    # ('i', 5)
    # ('was', 4)
    # ('we', 4)
    # ('my', 4)
    # ('now', 4)
    # ('me', 3)
    # ('how', 3)
    # ('amazing', 2)
    # ('sound', 2)
    
    
    
    

    以上的代码中给出了大量的注释,如果有出现纰漏,请多多交流,可以邮箱联系。

    参考链接
    排序:https://www.cnblogs.com/timtike/p/6562402.html
    文件操作:http://www.jb51.net/article/87398.htm
    Python3基础:http://www.runoob.com/python3/python3-tutorial.html

  • 相关阅读:
    子网掩码
    linux中grep工具
    C#尝试读取或写入受保护的内存。这通常指示其他内存已损坏。
    c#常用的Datable转换为json,以及json转换为DataTable操作方法
    easyui-从数据库读取创建无极菜单
    wpf 进度条 下拉
    进度条与执行过程
    属性表格 datagridproperty
    Jquery easyui开启行编辑模式增删改操作
    asp.net (jquery easy-ui datagrid)通用Excel文件导出(NPOI)
  • 原文地址:https://www.cnblogs.com/lger/p/8619760.html
Copyright © 2011-2022 走看看