zoukankan      html  css  js  c++  java
  • 课程作业——综合练习:英文词频统计

    我在统计词频的过程中是使用了amazing grace的歌词,歌词大致如下:

    Amazing Grace, how sweet the sound,
    That saved a wretch like me....
    I once was lost but now am found,
    Was blind, but now, I see.
    
    T'was Grace that taught...
    my heart to fear.
    And Grace, my fears relieved.
    How precious did that Grace appear...
    the hour I first believed.
    
    Through many dangers, toils and snares...
    we have already come.
    T'was Grace that brought us safe thus far...
    and Grace will lead us home.
    
    The Lord has promised good to me...
    His word my hope secures.
    He will my shield and portion be...
    as long as life endures.
    
    When we've been here ten thousand years...
    bright shining as the sun.
    We've no less days to sing God's praise...
    then when we've first begun.
    
    Amazing Grace, how sweet the sound,
    That saved a wretch like me....
    I once was lost but now am found,
    Was blind, but now, I see.
    

    我将上面的歌词放入一个lyric.txt的文件中,现在就可以进行代码编写,代码如下:

    file = open('lyric.txt', 'r')  # 只读打开文件
    lyrics = file.read()  # 读取文件内容
    file.close()  # 关闭文件资源
    lyrics = lyrics.lower()  # 设置内容全部小写
    
    # 将内容中的连词,冠词等去掉
    lyrics = lyrics.replace('the', '').replace('was', 'wes').replace('as', '').replace('wes', 'was').replace('but', '') 
        .replace('that', '').replace('and', ' ').replace('to', ' ').replace("we've", 'we').replace("we're", 'we')
    lyrics = lyrics.replace(',', ' ').replace('.', ' ')
    # print(lyrics)
    # print('-' * 100)
    
    lyricsWordList = lyrics.split()  # 将歌词全部转为单个单词为元素的list
    lyricsWordSet = set(lyricsWordList)  # 将list转为set,可以将重复元素去掉
    # print(lyricsWordList)
    # print('-' * 100)
    # print(lyricsWordSet)
    # print('-' * 100)
    lyricsDict = {}  # 初始化歌词字典
    # 遍历歌词set(因为set的key不重复,所以使用key作为统计基准)
    for word in lyricsWordSet:
        lyricsDict[word] = lyricsWordList.count(word)  # 字典的key使用set的key,value使用list中统计的出现个数
    # print(lyricsDict)
    # print('-' * 100)
    lyricsWordListSort = sorted(lyricsDict.items(), key=lambda d: d[1], reverse=True)  # 将字典中的各个点变成排序完成的list
    # print(lyricsWordListSort)
    # print('-' * 100)
    
    # 打印前十最高词频
    for i in range(10):
        if i >= len(lyricsWordListSort):
            break
        print(lyricsWordListSort[i])
    
    
    # 打印结果:
    # ('grace', 7)
    # ('i', 5)
    # ('was', 4)
    # ('we', 4)
    # ('my', 4)
    # ('now', 4)
    # ('me', 3)
    # ('how', 3)
    # ('amazing', 2)
    # ('sound', 2)
    
    
    
    

    以上的代码中给出了大量的注释,如果有出现纰漏,请多多交流,可以邮箱联系。

    参考链接
    排序:https://www.cnblogs.com/timtike/p/6562402.html
    文件操作:http://www.jb51.net/article/87398.htm
    Python3基础:http://www.runoob.com/python3/python3-tutorial.html

  • 相关阅读:
    nginx学习二:快速入门
    nginx学习一:http协议
    使用itext生成pdf的,各种布局
    关于java poi itext生成pdf文件的例子以及方法
    半透明全屏蒙层+全屏屏蔽+内容居中+css
    通过html文件生成PDF文件
    mybatis中文官网
    经典sql语句
    关于Cell中的各种值的类型判断
    bootstrap表格参数说明
  • 原文地址:https://www.cnblogs.com/lger/p/8619760.html
Copyright © 2011-2022 走看看