词频统计两种实现方法 - 走看看

zoukankan html css js c++ java

词频统计两种实现方法

第一种：vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))

举例：

from collections import Counter

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

c = Counter(colors)

print (dict(c))

most_common：取top-k的数据

第二种:

def generate_vocab_file(input_seg_file, output_vocab_file):
　　with open(input_seg_file, 'r',encoding='UTF-8') as f:
　　lines = f.readlines()
　　word_dict = {}
　　for line in lines:
　　label, content = line.strip(' ').split(' ')
　　for word in content.split():
　　word_dict.setdefault(word, 0)
　　word_dict[word] += 1
　　# [(word, frequency), ..., ()]
　　sorted_word_dict = sorted(
　　word_dict.items(), key = lambda d:d[1], reverse=True)
　　with open(output_vocab_file, 'w',encoding='UTF-8') as f:
　　f.write('<UNK> 10000000 ')
　　for item in sorted_word_dict:
　　f.write('%s %d ' % (item[0], item[1]))

类似实现：

colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

result = {}

for color in colors:

　　if result.get(color)==None:

　　　　 result[color]=1

　　else:

　　　　result[color]+=1

print (result) #{'red': 2, 'blue': 3, 'green': 1}

查看全文

相关阅读:
wzplayer2 支持mac 了，最新谍报
 关于duilib的理解
 DMS的实现转载
 视频通话最新谍报
 新人补钙系列教程之：Function类的重要方法apply()
新人补钙系列教程之：webgame好友模块原型开发一
 新人补钙系列教程之：大型 webGame 开发系列之 pipes
新人补钙系列教程之：模拟java多线程Thread类
 flash学习网站
 新人补钙系列教程之：AS3与服务器通信

原文地址：https://www.cnblogs.com/kpwong/p/13560766.html

Copyright © 2011-2022 走看看