zoukankan      html  css  js  c++  java
  • 任一英文的纯文本文件,统计其中的单词出现个数

    第一版: 效率低

    path = 'test.txt'
    with open(path,encoding='utf-8',newline='') as f:
        word = []
        words_dict= {}
        for letter in f.read():
            if letter.isalnum():
                word.append(letter)
            elif letter.isspace(): #空白字符 空格 	 
    
                if word:
                    word = ''.join(word).lower() #转小写
                    if word not in words_dict:
                        words_dict[word] = 1
                    else:
                        words_dict[word] += 1
                    word = []
    
    #处理最后一个单词
    if word:
        word = ''.join(word).lower()  # 转小写
        if word not in words_dict:
            words_dict[word] = 1
        else:
            words_dict[word] += 1
        word = []
    
    for k,v in words_dict.items():
        print(k,v)

    第二版:

    缺点:遇到大文件要一次读入内存,性能不好

    path = 'test.txt'
    with open(path,'r',encoding='utf-8') as f:
        data = f.read()
        word_reg = re.compile(r'w+')
        #word_reg = re.compile(r'w+')
        word_list = word_reg.findall(data)
        word_list = [word.lower() for word in word_list] #转小写
        word_set = set(word_list)  #避免重复查询
        # words_dict = {}
        # for word in word_set:
        #     words_dict[word] = word_list.count(word)
    
        # 简洁写法
        words_dict = {word: word_list.count(word) for word in word_set}
        for k,v in words_dict.items():
            print(k,v)

    第三版:

    path = 'test.txt'
    with open(path, 'r', encoding='utf-8') as f:
        word_list = []
        word_reg = re.compile(r'w+')
        for line in f:
            #line_words = word_reg.findall(line)
            #比上面的正则更加简单
            line_words = line.split()
            word_list.extend(line_words)
        word_set = set(word_list)  # 避免重复查询
        words_dict = {word: word_list.count(word) for word in word_set}
        for k, v in words_dict.items():
            print(k, v)

     第四版:使用Counter统计

    import collections  
    path = 'test.txt'
    with open(path, 'r', encoding='utf-8') as f:
        word_list = []
        word_reg = re.compile(r'w+')
        for line in f:
            line_words = line.split()
            word_list.extend(line_words)
       
        words_dict = dict(Counter(word_list)) #使用Counter统计
        for k, v in words_dict.items():
            print(k, v)
  • 相关阅读:
    codevs 2010 求后序遍历
    code vs 1013 求先序排列
    codevs 3143 二叉树的序遍历
    codevs 3083 二叉树
    找树的根和孩子
    1501 二叉树最大宽度和高度
    1758:二叉树
    sql 如何把查询得到的结果如何放入一个新表中
    2011的n次方
    计算2的N次方
  • 原文地址:https://www.cnblogs.com/hupeng1234/p/6680491.html
Copyright © 2011-2022 走看看