zoukankan      html  css  js  c++  java
  • python每日一题:统计文档里单词的频率

    题目: 你有一篇日记,为了避免分词的问题,假设内容都是英文,请统计出你认为每篇日记单词的频率。

    要求:1.以字典格式输出每个单词的出现频率 2.算法尽量简洁

    方法1: 采用findall函数,列出所有单词。然后采用counter函数进行单词统计

    import re
    from  collections import Counter
    b=[]
    with open(r'c:hello.log','r') as f:
        for i in f:
            s=re.findall(r'[a-zA-Z0-9]+' '',i.lower())#查找单词,其标志为一系列的字母+空格。findall函数最方便的地方是罗列出所有的单词,不用再转换
            b.extend(s)
        print(b)
        print(Counter(b))

      调试结果如下:

    ['everyone', 'has', 'their', 'own', 'dreams', 'i', 'am', 'the', 'same', 'but', 'my', 'dream', 'is', 'not', 'a', 'lawyer', 'not', 'a', 'doctor', 'not', 'actors', 'not', 'even', 'an', 'industry', 'perhaps', 'my', 'dream', 'big', 'people', 'will', 'find', 'it', 'ridiculous', 'but', 'this', 'has', 'been', 'my', 'pursuit', 'my', 'dream', 'is', 'to', 'want', 'to', 'have', 'a', 'folk', 'life', 'i', 'want', 'it', 'to', 'become', 'a', 'beautiful', 'painting', 'it', 'is', 'not', 'only', 'sharp', 'colors', 'but', 'also', 'the', 'colors', 'are', 'bleak', 'i', 'do', 'not', 'rule', 'out', 'the', 'painting', 'is', 'part', 'of', 'the', 'black', 'but', 'i', 'will', 'treasure', 'these', 'bleak', 'colors', 'not', 'yet', 'how', 'about', 'a', 'colorful', 'painting', 'if', 'not', 'bleak', 'add', 'color', 'how', 'can', 'it', 'more', 'prominent', 'american', 'life', 'is', 'like', 'painting', 'ainting', 'the', 'bright', 'red', 'color', 'represents', 'life', 'beautiful', 'happy', 'moments', 'painting', 'a', 'bleak', 'color', 'represents', 'life', 'difficult', 'unpleasant', 'time', 'you', 'may', 'find', 'a', 'flat', 'with', 'a', 'beautiful', 'road', 'is', 'not', 'very', 'good', 'yet', 'but', 'i', 'do', 'not', 'think', 'it', 'will', 'if', 'a', 'person', 'lives', 'flat', 'then', 'what', 'is', 'the', 'point', 'life', 'is', 'only', 'a', 'short', 'few', 'decades', 'i', 'want', 'it', 'to', 'go', 'finally', 'each', 'memory', 'is', 'a', 'solid']
    Counter({'a': 11, 'not': 10, 'is': 9, 'i': 6, 'the': 6, 'it': 6, 'but': 5, 'life': 5, 'painting': 5, 'my': 4, 'to': 4, 'bleak': 4, 'dream': 3, 'will': 3, 'want': 3, 'beautiful': 3, 'colors': 3, 'color': 3, 'has': 2, 'find': 2, 'only': 2, 'do': 2, 'yet': 2, 'how': 2, 'if': 2, 'represents': 2, 'flat': 2, 'everyone': 1, 'their': 1, 'own': 1, 'dreams': 1, 'am': 1, 'same': 1, 'lawyer': 1, 'doctor': 1, 'actors': 1, 'even': 1, 'an': 1, 'industry': 1, 'perhaps': 1, 'big': 1, 'people': 1, 'ridiculous': 1, 'this': 1, 'been': 1, 'pursuit': 1, 'have': 1, 'folk': 1, 'become': 1, 'sharp': 1, 'also': 1, 'are': 1, 'rule': 1, 'out': 1, 'part': 1, 'of': 1, 'black': 1, 'treasure': 1, 'these': 1, 'about': 1, 'colorful': 1, 'add': 1, 'can': 1, 'more': 1, 'prominent': 1, 'american': 1, 'like': 1, 'ainting': 1, 'bright': 1, 'red': 1, 'happy': 1, 'moments': 1, 'difficult': 1, 'unpleasant': 1, 'time': 1, 'you': 1, 'may': 1, 'with': 1, 'road': 1, 'very': 1, 'good': 1, 'think': 1, 'person': 1, 'lives': 1, 'then': 1, 'what': 1, 'point': 1, 'short': 1, 'few': 1, 'decades': 1, 'go': 1, 'finally': 1, 'each': 1, 'memory': 1, 'solid': 1})

     方法二: 采用re.sub函数将非单词转换为空格,然后采用strip函数去除开头的空格,转换为列表。再进行统计。

    import re
    from  collections import Counter
    b=[]
    with open(r'c:hello.log','r') as f:
        for i in f:
            s=re.sub(r'[^a-zA-Z0-9]+',' ',i.lower())
            print(s)
            b.extend(s.strip().split())
        print(Counter(b))

     调试结果如下:

    everyone has their own dreams 
    i am the same but my dream is not a lawyer 
    not a doctor not actors not even an industry 
     perhaps my dream big people will find it ridiculous 
    but this has been my pursuit my dream is to want to have a folk life 
    i want it to become a beautiful painting it is not only sharp colors 
     but also the colors are bleak i do not rule out the painting is part of the black 
    but i will treasure these bleak colors not yet how about a colorful painting 
    if not bleak add color how can it more prominent american life is like painting 
    ainting the bright red color represents life beautiful happy moments painting a bleak 
    color represents life difficult unpleasant time you may find a flat with a beautiful 
    road is not very good yet but i do not think it will if a person lives flat then what 
     is the point life is only a short few decades i want it to go finally each memory is 
     a solid 
    Counter({'a': 11, 'not': 10, 'is': 9, 'i': 6, 'the': 6, 'it': 6, 'but': 5, 'life': 5, 'painting': 5, 'my': 4, 'to': 4, 'bleak': 4, 'dream': 3, 'will': 3, 'want': 3, 'beautiful': 3, 'colors': 3, 'color': 3, 'has': 2, 'find': 2, 'only': 2, 'do': 2, 'yet': 2, 'how': 2, 'if': 2, 'represents': 2, 'flat': 2, 'everyone': 1, 'their': 1, 'own': 1, 'dreams': 1, 'am': 1, 'same': 1, 'lawyer': 1, 'doctor': 1, 'actors': 1, 'even': 1, 'an': 1, 'industry': 1, 'perhaps': 1, 'big': 1, 'people': 1, 'ridiculous': 1, 'this': 1, 'been': 1, 'pursuit': 1, 'have': 1, 'folk': 1, 'become': 1, 'sharp': 1, 'also': 1, 'are': 1, 'rule': 1, 'out': 1, 'part': 1, 'of': 1, 'black': 1, 'treasure': 1, 'these': 1, 'about': 1, 'colorful': 1, 'add': 1, 'can': 1, 'more': 1, 'prominent': 1, 'american': 1, 'like': 1, 'ainting': 1, 'bright': 1, 'red': 1, 'happy': 1, 'moments': 1, 'difficult': 1, 'unpleasant': 1, 'time': 1, 'you': 1, 'may': 1, 'with': 1, 'road': 1, 'very': 1, 'good': 1, 'think': 1, 'person': 1, 'lives': 1, 'then': 1, 'what': 1, 'point': 1, 'short': 1, 'few': 1, 'decades': 1, 'go': 1, 'finally': 1, 'each': 1, 'memory': 1, 'solid': 1})
  • 相关阅读:
    Spring中配置和读取多个Properties文件
    python 数据清洗
    python excel 文件合并
    Pandas -- Merge,join and concatenate
    python 数据合并
    python pandas
    python Numpy
    EXCEL 导入 R 的几种方法 R—readr和readxl包
    R语言笔记完整版
    第十三章 多项式回归分析
  • 原文地址:https://www.cnblogs.com/xuehaiwuya0000/p/10132444.html
Copyright © 2011-2022 走看看