zoukankan      html  css  js  c++  java
  • 词频统计 两种实现方法

    第一种:vocab = dict(Counter(text).most_common(MAX_VOCAB_SIZE-1))

    举例:

    from collections import Counter 

    colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

    c = Counter(colors)

    print (dict(c))

    most_common:取top-k的数据

    第二种:

    def generate_vocab_file(input_seg_file, output_vocab_file):
      with open(input_seg_file, 'r',encoding='UTF-8') as f:
      lines = f.readlines()
      word_dict = {}
      for line in lines:
      label, content = line.strip(' ').split(' ')
      for word in content.split():
      word_dict.setdefault(word, 0)
      word_dict[word] += 1
      # [(word, frequency), ..., ()]
      sorted_word_dict = sorted(
      word_dict.items(), key = lambda d:d[1], reverse=True)
      with open(output_vocab_file, 'w',encoding='UTF-8') as f:
      f.write('<UNK> 10000000 ')
      for item in sorted_word_dict:
      f.write('%s %d ' % (item[0], item[1]))

    类似实现:

    colors = ['red', 'blue', 'red', 'green', 'blue', 'blue']

    result = {}

    for color in colors:

      if result.get(color)==None:

         result[color]=1

      else:

        result[color]+=1

    print (result) #{'red': 2, 'blue': 3, 'green': 1}

  • 相关阅读:
    Qt 学习之路 2(39):遍历容器
    Qt 学习之路 2(38):存储容器
    JS 格式化日期
    springboot 核心注解
    Java 生成随机数 Random、SecurityRandom、ThreadLocalRandom、Math.random()
    验证码 easy_captcha
    读过的书籍
    typora 常用快捷键
    kafka 遇到的问题
    老男孩Linux 运维
  • 原文地址:https://www.cnblogs.com/kpwong/p/13560766.html
Copyright © 2011-2022 走看看