zoukankan      html  css  js  c++  java
  • 统计一段文章的单词频率,取出频率最高的5个单词和个数(python)

    练习题:统计一段英语文章的单词频率,取出频率最高的5个单词和个数(用python实现)

    先全部转为小写再判定 lower()

    怎么判定单词?
    1 不是字母的特殊字符作为分隔符分割字符串 (避免特殊字符的处理不便,全部替换成'-')
    2 正则分割
    3 遍历字符串,取每个word
    4 正则匹配

    怎么统计个数?
    将wordlist的word和word的个数放入dict,排序


    '''
    dinghanhua
    2018-11-11
    练习:一段英文文章,统计每个单词的频率,返回出现频率最高的5个单词和次数
    '''
    
    import re
    
    art = ' If we want to" run Locust  / distributed on multiple machines we would also have to specify the master host when starting the slaves (this is not needed when running Locust distributed on a single machine, since the master host defaults to 127.0.0.1):'
    
    '''
    怎么判定单词?
    1 不是字母的特殊字符作为分隔符分割字符串
    2 遍历字符串,取每个word
    3 正则匹配
    
    怎么统计个数?
    将wordlist的word和word的个数放入dict,排序
    '''
    word_dict = {} #用于统计 word:个数
    word_list = [] #用于存放所有单词
    #  找出所有不是字母的字符替换成统一的字符,split()分割之后便是单词
    pattern = r'[^a-z]+'
    art_new = re.sub(pattern,'-',art.lower()) #所有的非字母替换成-
    word_list = art_new.split('-') #转成小写分隔单词
    wordlist = list(filter(lambda x : x != '',word_list)) #去掉空串
    
    print('所有的单词列表:',wordlist)
    #正则表达式分隔
    pattern = r'[^a-z]+' #非字母
    word_list = re.split(pattern,art.lower()) #还要去除空串
    print(word_list)
    # 遍历字符串,获取每个word追加到wordlist (不好)
    word =''
    word_list2 = []
    
    for letter in art.lower():
        if letter.isalpha(): #如果是字母,追加到word
            word += letter
        else:
            if word != '':
                word_list2.append(word) #不是字母,word不为空的话追加wordlist
                word = '' # word置空
    print(word_list2)
    # 正则表达式匹配单词
    pattern = r'[a-z]+'
    word_list3 = re.findall(pattern,art.lower())
    print(word_list3)

    最后的统计的代码:

    #统计
    for word in set(word_list):
        word_dict[word] = word_list.count(word) #key=单词,value=单词在list里的count
    
    #取最多的前五个
    print(sorted(word_dict.items(),key = lambda x:x[1],reverse=True)[0:5]) #dict根据value倒序,取前5个
    word_dict = {}.fromkeys(word_list) #先用list生成dict的keys
    for word in word_dict.keys():
        word_dict[word] = word_list.count(word)

    the end!

  • 相关阅读:
    LeetCode 453 Minimum Moves to Equal Array Elements
    LeetCode 112 Path Sum
    LeetCode 437 Path Sum III
    LeetCode 263 Ugly Number
    Solutions and Summay for Linked List Naive and Easy Questions
    AWS–Sysops notes
    Linked List
    All About Linked List
    datatable fix error–Invalid JSON response
    [转]反编译c#的相关问题
  • 原文地址:https://www.cnblogs.com/dinghanhua/p/9942691.html
Copyright © 2011-2022 走看看