如何统计序列中元素的频度
问题举例
如何找出随机序列[1, 5, 6, 5, 3, 2, 1, 0, 6, 1, 6]中出现频度最高的3个元素?
如何统计某篇英文文章中词频最高的5个单词?
将序列转换成字典(元素:频度),根据字典的值进行排序
列表
from random import randint list1 = [randint(0, 10) for _ in range(30)] print(list1) dict1 = dict.fromkeys(list1, 0) for item in list1: dict1[item] += 1 #list comprehensions dict_res1 = sorted([(v, k) for k, v in dict1.items()], reverse=True)[:3] print(dict_res1) #generator comprehensions dict_res2 = sorted(((v, k) for k, v in dict1.items()), reverse=True)[:3] print(dict_res2)
分析:使用生成器解析比列表解析节省空间
当一个列表很大时,我们只需要找到出现频度最高的3个元素,如果我们对整个列表都进行排序,
这样显然是很浪费的,一般这种情况我们会使用堆排序
堆排序
from random import randint import heapq list1 = [randint(0, 10) for _ in range(30)] print(list1) dict1 = dict.fromkeys(list1, 0) for item in list1: dict1[item] += 1 res = heapq.nlargest(3, ((v, k) for k, v in dict1.items())) print(res)
使用collections中的Counter对象
from random import randint from collections import Counter list1 = [randint(0, 10) for _ in range(30)] print(list1) dict1 = dict.fromkeys(list1, 0) for item in list1: dict1[item] += 1 counter1 = Counter(dict1) res = counter1.most_common(3) print(res)
词频统计栗子
import re from collections import Counter txt = open('note.txt').read() word_list = re.split('W+', txt) counter1 = Counter(word_list) res = counter1.most_common(3) print(res)
参考资料:python3实用编程技巧进阶