zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    1.英文词频统

    下载一首英文的歌词或文章

    article = '''An empty street
    An empty house
    A hole inside my heart
    I'm all alone
    The rooms are getting smaller
    I wonder how
    I wonder why
    I wonder where they are
    The days we had
    The songs we sang together
    Oh yeah
    And oh my love
    I'm holding on forever
    Reaching for a love that seems so far
    So i say a little prayer
    And hope my dreams will take me there
    Where the skies are blue to see you once again, my love
    Over seas and coast to coast
    To find a place i love the most
    Where the fields are green to see you once again, my love
    I try to read
    I go to work
    I'm laughing with my friends
    But i can't stop to keep myself from thinking
    Oh no I wonder how
    I wonder why
    I wonder where they are
    The days we had
    The songs we sang together
    Oh yeah And oh my love
    I'm holding on forever
    Reaching for a love that seems so far Mark:
    To hold you in my arms
    To promise you my love
    To tell you from the heart
    You're all i'm thinking of
    I'm reaching for a love that seems so far 
    So i say a little prayer
    And hope my dreams will take me there
    Where the skies are blue to see you once again, my love
    Over seas and coast to coast
    To find a place i love the most
    Where the fields are green to see you once again,my love
    say a little prayer
    dreams will take me there
    Where the skies are blue to see you once again '''
    

      

    将所有,.?!’:等分隔符全部替换为空格

    sep = ''':.,?!'''
    for i in sep:
        article = article.replace(i,' ');
    

      

    将所有大写转换为小写

    	
    article = article.lower();
    

      

    生成单词列表

    article_list = article.split();
    print(article_list);
    

      

    生成词频统计

    # # ①统计,遍历集合
    
    # article_dict={}
    # article_set =set(article_list)-exclude# 清除重复的部分
    # for w in article_set:
    #     article_dict[w] = article_list.count(w)
    # # 遍历字典
    # for w in article_dict:
    #     print(w,article_dict[w])
     
     
    #方法②,遍历列表
    article_dict={}
    for w in article_list:
        article_dict[w] = article_dict.get(w,0)+1
    # 排除不要的单词
    for w in exclude:
        del (article_dict[w]);
     
    for w in article_dict:
        print(w,article_dict[w])  
    

      

     

    排序

    dictList = list(article_dict.items())
    dictList.sort(key=lambda x:x[1],reverse=True); 
    

      

    排除语法型词汇,代词、冠词、连词

    exclude = {'the','to','is','and'}
    for w in exclude:
        del (article_dict[w]); 
    

      

    输出词频最大TOP20

    for i in range(20):
         print(dictList[i]) 
    

      

    将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    file =  open("test.txt", "r",encoding='utf-8');
    article = file.read();
    file.close()
    

      

    2.中文词频统计

    下载一长篇中文文章。

    从文件读取待分析文本。

    news = open('gzccnews.txt','r',encoding = 'utf-8')

    安装与使用jieba进行中文分词。

    pip install jieba

    import jieba

    list(jieba.lcut(news))

    生成词频统计

    排序

    排除语法型词汇,代词、冠词、连词

    输出词频最大TOP20(或把结果存放到文件里)

    import jieba
     
    #打开文件
    file = open("gzccnews.txt",'r',encoding="utf-8")
    notes = file.read();
    file.close();
     
    #替换标点符号
    sep = ''':。,?!;∶ ...“”'''
    for i in sep:
        notes = notes.replace(i,' ');
     
    notes_list = list(jieba.cut(notes));
     
     
    #排除单词
    exclude =[' ','
    ','我','你','边','上','说,'了','的','那','些','什','么','话','呢']
     
     
    #方法②,遍历列表
    notes_dict={}
    for w in notes_list:
        notes_dict[w] = notes_dict.get(w,0)+1
     
    # 排除不要的单词
    for w in exclude:
        del (notes_dict[w]);
     
    for w in notes_dict:
        print(w,notes_dict[w])
     
     
    # 降序排序
    dictList = list(notes_dict.items())
    dictList.sort(key=lambda x:x[1],reverse=True);
    print(dictList)
     
    #输出词频最大TOP20
    for i in range(20):
        print(dictList[i])
     
    #把结果存放到文件里
    outfile = open("top20.txt","a")
    for i in range(20):
        outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"
    ")
    outfile.close();
    

      

    将代码与运行结果截图发布在博客上。

  • 相关阅读:
    5的阶乘以及任意输入一个数的阶乘
    继入门程序后的第一篇函数调用的小程序 比较两数大小
    计算机网络01-计算机网络与因特网
    2021春招冲刺-1227 数组去重 | 响应式布局 | 媒体查询 |浏览器帧
    2021春招冲刺-1225 TCP与UDP | 单例模式 | 回流与重绘
    2021春招冲刺-1223 进程线程的通信 | 字符串是否有效 | 数组转换与展平
    2021春招冲刺-1221 进程与线程的区别 | 进程的切换 | 单链表是否相交 | 元素水平/垂直居中的方式
    左边固定,右边自适应解决方案
    mock 模拟数据在框架中的简单使用
    一个页面从输入url到加载到内容,这个过程经历了什么
  • 原文地址:https://www.cnblogs.com/qq412158152/p/8660824.html
Copyright © 2011-2022 走看看