zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    一、英文词频统计

    1.下载一首英文的歌词或文章

    I said,"dad, you might leave now.” But he looked out of the window and said,” I’m going to buy you some tangerines. You just stay here. Don't move around.” I caught sight of several vendors waiting for customers outside the railings beyond a platform. But to reach that platform would require crossing the railway track and doing some climbing up and down. That would be a strenuous job for father, who was fat. I wanted to do all that myself, but he stopped me, so I could do nothing but let him go. I watched him hobble towards the railway track in his black skullcap, black cloth mandarin jacket and dark blue cotton-padded cloth ling gown. He had little trouble climbing down the railway track, but it was a lot more difficult for him to climb up that platform after crossing the railway track. His hands held onto the upper part of the platform, his legs huddled up and his corpulent body tipped slightly towards the left, obviously making an enormous exertion. While I was watching him from behind, tears gushed from my eyes. I quickly wiped them away lest he or others should catch me crying. The next moment when I looked out of the window again, father was already on the way back, holding bright red tangerines in both hands. In crossing the railway track, he first put the tangerines on the ground, climbed down slowly and then picked them up again. When he came near the train, I hurried out to help him by the hand. After boarding the train with me, he laid all the tangerines on my overcoat, and patting the dirt off his clothes, he looked somewhat relieved and said after a while,” I must be going now. Don’t forget to write me from Beijing!” I gazed after his back retreating out of the carriage. After a few steps, he looked back at me and said, "Go back to your seat. Don’t leave your things alone." I, however, did not go back to my seat until his figure was lost among crowds of people hurrying to and fro and no longer visible. My eyes were again wet with tears.

    2.将所有,.?!’:等分隔符全部替换为空格

    sep = ''':.,?!'''
    for i in sep:
        article = article.replace(i,' ');

     3.将所有大写转换为小写

    article = article.lower();
    

      4.生成单词列表

    article_list = article.split();
    print(article_list);
    

     5.生成词频统计

    article_dict={}
    for w in article_list:
        article_dict[w] = article_dict.get(w,0)+1
    for w in exclude:
        del (article_dict[w]);
     
    for w in article_dict:
        print(w,article_dict[w])  
    

      6.排序

    dictList = list(article_dict.items())
    dictList.sort(key=lambda x:x[1],reverse=True);  
    

      7.排除语法型词汇,代词、冠词、连词

    exclude = {'the','to','is','and'}
    for w in exclude:
        del (article_dict[w]); 
    

      8.输出词频最大TOP20

    for i in range(20):
         print(dictList[i])  
    

      9.将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    file =  open("test.txt", "r",encoding='utf-8');
    article = file.read();
    file.close()
    

      

    二、中文词频统计,下载一长篇中文文章。

    import jieba
     
    #打开文件
    file = open("zgsjtl.txt",'r',encoding="utf-8")
    notes = file.read();
    file.close();
     
    #替换标点符号
    sep = ''':。,?!;∶ ...“”'''
    for i in sep:
        notes = notes.replace(i,' ');
     
    notes_list = list(jieba.cut(notes));
     
     
    
    exclude =[' ','
    ','你','嗯','他','和','但','啊','的','来','是','去','在','上','走']
     
    notes_dict={}
    for w in notes_list:
        notes_dict[w] = notes_dict.get(w,0)+1
     
    for w in exclude:
        del (notes_dict[w]);
     
    for w in notes_dict:
        print(w,notes_dict[w])
     
     
    
    dictList = list(notes_dict.items())
    dictList.sort(key=lambda x:x[1],reverse=True);
    print(dictList)
     
    
    for i in range(20):
        print(dictList[i])
    
    outfile = open("top20.txt","a")
    for i in range(20):
        outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"
    ")
    outfile.close();
    

      

  • 相关阅读:
    CSS Grid 网格布局全解析
    VSCode插件和首选项配置
    mybatis分页实现原理
    org.apache.ibatis.exceptions.PersistenceException解决办法
    使用ssm框架在使用id查询时应注意的问题
    出现Caused by: org.apache.ibatis.executor.ExecutorException: No constructor found in com.duowenjia.bean.StudentInfo matching [java.lang.Integer, java.lang.String, java.lang.String, java.lang.String]的问题
    如何实现每个页面都判断session
    count(*),count(常量),count(列名)的区别
    json的jsonarray是有区别的
    毕业设计(高校网上作业提交系统)开发记录(20)
  • 原文地址:https://www.cnblogs.com/jiesheng/p/8660763.html
Copyright © 2011-2022 走看看