我在统计词频的过程中是使用了amazing grace的歌词,歌词大致如下:
Amazing Grace, how sweet the sound,
That saved a wretch like me....
I once was lost but now am found,
Was blind, but now, I see.
T'was Grace that taught...
my heart to fear.
And Grace, my fears relieved.
How precious did that Grace appear...
the hour I first believed.
Through many dangers, toils and snares...
we have already come.
T'was Grace that brought us safe thus far...
and Grace will lead us home.
The Lord has promised good to me...
His word my hope secures.
He will my shield and portion be...
as long as life endures.
When we've been here ten thousand years...
bright shining as the sun.
We've no less days to sing God's praise...
then when we've first begun.
Amazing Grace, how sweet the sound,
That saved a wretch like me....
I once was lost but now am found,
Was blind, but now, I see.
我将上面的歌词放入一个lyric.txt
的文件中,现在就可以进行代码编写,代码如下:
file = open('lyric.txt', 'r') # 只读打开文件
lyrics = file.read() # 读取文件内容
file.close() # 关闭文件资源
lyrics = lyrics.lower() # 设置内容全部小写
# 将内容中的连词,冠词等去掉
lyrics = lyrics.replace('the', '').replace('was', 'wes').replace('as', '').replace('wes', 'was').replace('but', '')
.replace('that', '').replace('and', ' ').replace('to', ' ').replace("we've", 'we').replace("we're", 'we')
lyrics = lyrics.replace(',', ' ').replace('.', ' ')
# print(lyrics)
# print('-' * 100)
lyricsWordList = lyrics.split() # 将歌词全部转为单个单词为元素的list
lyricsWordSet = set(lyricsWordList) # 将list转为set,可以将重复元素去掉
# print(lyricsWordList)
# print('-' * 100)
# print(lyricsWordSet)
# print('-' * 100)
lyricsDict = {} # 初始化歌词字典
# 遍历歌词set(因为set的key不重复,所以使用key作为统计基准)
for word in lyricsWordSet:
lyricsDict[word] = lyricsWordList.count(word) # 字典的key使用set的key,value使用list中统计的出现个数
# print(lyricsDict)
# print('-' * 100)
lyricsWordListSort = sorted(lyricsDict.items(), key=lambda d: d[1], reverse=True) # 将字典中的各个点变成排序完成的list
# print(lyricsWordListSort)
# print('-' * 100)
# 打印前十最高词频
for i in range(10):
if i >= len(lyricsWordListSort):
break
print(lyricsWordListSort[i])
# 打印结果:
# ('grace', 7)
# ('i', 5)
# ('was', 4)
# ('we', 4)
# ('my', 4)
# ('now', 4)
# ('me', 3)
# ('how', 3)
# ('amazing', 2)
# ('sound', 2)
以上的代码中给出了大量的注释,如果有出现纰漏,请多多交流,可以邮箱联系。
参考链接
排序:https://www.cnblogs.com/timtike/p/6562402.html
文件操作:http://www.jb51.net/article/87398.htm
Python3基础:http://www.runoob.com/python3/python3-tutorial.html