zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    1.英文词频统

    下载一首英文的歌词或文章

    将所有,.?!’:等分隔符全部替换为空格

    将所有大写转换为小写

    生成单词列表

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20

    news='''
    I remember quite clearly now when the story happened. The autumn leaves were floating in measure down to the ground, recovering the lake, where we used to swim like children, under the sun was there to shine. That time we used to be happy. Well, I thought we were. But the truth was that you had been longing to leave me, not daring to tell me. On that precious night, watching the lake, vaguely conscious, you said: Our story is ending. 
    
    The rain was killing the last days of summer, you had been killing my last breath of love since along time ago. I still don't think I'm gonna make it through another love story. You took it all away from me. And there I stand, I knew I was going to be the one left behind. But still I'm watching the lake, vaguely Conscious, and I know my life is ending. 
    '''
    sep = ''',.?":;()'''
    for c in sep:
        news = news.replace(c,' ')
    wordList = news.lower().split()
    for w in wordList:
          print(w)
    
    wordDist = {}
    wordSet = set(wordList)
    for w in wordSet:
        wordDist[w] = wordList.count(w)
    
    for w in wordDist:
        print(w, wordDist[w])
    
    dictList = list(wordDist.items())
    dictList.sort(key = lambda x: x[1], reverse=True)
    exclude = {'the','of','and','s','to','which','will','as','on','is','by',}
    wordSet=set(wordList)-exclude
    for w in wordSet:
        wordDist[w]=wordList.count(w)
    
    for i in range(20):
        print(dictList[i])
    

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    f=open('news.txt','r',encoding='utf-8')
    news=f.read()
    f.close()
    print(news)
    

     

    2.中文词频统计

    下载一长篇中文文章。

    从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.lcut(news))

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20(或把结果存放到文件里)

     

    import jieba
     
    file=open('hong.txt','r',encoding='utf-8')
    word=file.read()
    file.close()
    wordList=list(jieba.cut_for_search(word))
     
    wordDist={}
    for w in wordList:
        wordDist[w] = wordList.count(w)
     
    for w in wordDist:
        print(w, wordDist[w])
    dictList = list(wordDist.items())
    dictList.sort(key = lambda x: x[1], reverse=True)
    sep=''',。?“”:、?;!!'''
     
    exclude ={' ','
    ','了','的','u3000','他','我','也','又','是','你','着','这','就','都','呢','只'}
     
    for c in sep:
        word = word.replace(c,' ')
     
    wordSet=set(wordList)-exclude
    f=open('hongcount.txt','a')
    for i in range(20):
        f.write(dictList[i][0]+' '+str(dictList[i][1])+'
    ')
    f.close()
    

     

  • 相关阅读:
    从零入门 Serverless | 教你 7 步快速构建 GitLab 持续集成环境
    4 个场景揭秘,如何低成本让容器化应用 Serverless 化?
    如何无缝迁移 SpringCloud/Dubbo 应用到 Serverless 架构
    精准容量、秒级弹性,压测工具 + SAE 方案如何完美突破传统大促难关?
    golang 实现最小二乘法拟合直线
    golang 实现两数组对应元素相除
    js 算数组平均值、最大值、最小值、偏差、标准差、中位数、数组从小打大排序、上四分位数、下四分位数
    ajax传数组后台GO语言接收
    python 画图中文显示问题
    python stats画正态分布、指数分布、对数正态分布的QQ图
  • 原文地址:https://www.cnblogs.com/BOXczx/p/8666191.html
Copyright © 2011-2022 走看看