第一种:vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))
举例:
from collections import Counter
colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']
c = Counter(colors)
print (dict(c))
most_common:取top-k的数据
第二种:
def generate_vocab_file(input_seg_file, output_vocab_file):
with open(input_seg_file, 'r',encoding='UTF-8') as f:
lines = f.readlines()
word_dict = {}
for line in lines:
label, content = line.strip('
').split(' ')
for word in content.split():
word_dict.setdefault(word, 0)
word_dict[word] += 1
# [(word, frequency), ..., ()]
sorted_word_dict = sorted(
word_dict.items(), key = lambda d:d[1], reverse=True)
with open(output_vocab_file, 'w',encoding='UTF-8') as f:
f.write('<UNK> 10000000
')
for item in sorted_word_dict:
f.write('%s %d
' % (item[0], item[1]))
类似实现:
colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']
result = {}
for color in colors:
if result.get(color)==None:
result[color]=1
else:
result[color]+=1
print (result) #{'red': 2, 'blue': 3, 'green': 1}