zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    综合练习

    词频统计预处理

    下载一首英文的歌词或文章

    将所有,.?!’:等分隔符全部替换为空格

    将所有大写转换为小写

    生成单词列表

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

     一、
    news="wo are faimly.w aor kb.jg,wo are w wo Oh ma baby nal ddeo nan geu dae mam sok geu ro bul reo bon da"
    
    # f=open("news.txt","r")
    # news=f.read()
    # f.close()
    # print(news)
    
    #字典统计数字,键值对
    sep=''',?!."'''
    exclude={"wo","w"}
    
    for c in sep:
       news=news.replace(c," ")
    wordList=news.lower().split()#大写换成小写
    
    #一、字典遍历
    wordDict={}
    wordSet=set(wordList)-exclude
    for w in wordSet:
        wordDict[w]=wordList.count(w)
    
    for w in wordDict:
        print(w,wordDict[w])
    
    #二、列表遍历
    # wordDict={}
    # for w in wordList:
    #     wordDict[w]=wordDict.get(w,0)+1
    #
    # for w in exclude:
    #     del(wordDict[w])
    # #
    ('are', 2)
    ('geu', 2)
    ('baby', 1)
    ('aor', 1)
    ('kb', 1)
    ('bul', 1)
    ('oh', 1)
    ('faimly', 1)
    ('bon', 1)
    ('jg', 1)
    ('reo', 1)
    ('nal', 1)
    ('nan', 1)
    ('ma', 1)
    ('sok', 1)
    ('da', 1)
    ('ro', 1)
    ('ddeo', 1)
    ('dae', 1)
    ('mam', 1)
    # 保存文件,
    f = open('newscount.txt','a')
    for i in range(20):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()

      

  • 相关阅读:
    CF997C Sky Full of Stars
    LOJ6160 二分图染色
    AT4996 [AGC034F] RNG and XOR
    AT4119 [ARC096C] Everything on It
    20200701线性代数概率期望练习
    SNOI2020 LOJ3326 字符串
    SNOI2020 LOJ3323 生成树
    SNOI2020 LOJ3324 取石子
    Gym-102576A Bags of Candies
    Gym-102576H Lighthouses
  • 原文地址:https://www.cnblogs.com/wenjian1027/p/8659915.html
Copyright © 2011-2022 走看看