zoukankan      html  css  js  c++  java
  • 综合练习:词频统计

    下载一首英文的歌词或文章

    生成词频统计

    news='''At the same time, the market of TV dramas has also maintained rapid development. In 2017, the production volume of TV dramas in China reaches 310 and 13,000 sets, and continues to be the no.1 in the world. The "national treasure", "national treasure", "if the national treasure can talk" and other TV variety shows, documentaries, vividly spread the excellent Chinese traditional culture.
    With modern technology, traditional culture is rejuvenated. Hangzhou songcheng group with new technology to interpret ancient Chinese traditional story, the Qingdao publishing group is using virtual reality, 3 d printing technology, the audience can feel the charm of traditional culture anytime and anywhere.
    In recent years, China's cultural industry has been growing rapidly, and the pace of "going out" has been accelerating. As of last year, China's publishing enterprises set up more than 400 branches overseas and established cooperative partnership with over 500 publishing institutions in over 70 countries. People's day boat publishing co., LTD. Was set up in less than two years, has published "Chinese traditional festival" (in Arabic), "in a pocket of father" (French version) and so on more than 40 foreign language books.
     '''
    sep = ''',.;:''""'''
    for c in sep:
        news = news.replace(c, ' ')
    
    wordlist = news.lower().split()
    
    wordDict = {}
    for w in wordlist:
        wordDict[w] = wordDict.get(w, 0) + 1
    '''
    wordSet=set(wordlist)
    for w in wordSet:
        wordDict[w]=wordlist.count(w)
    '''
    for w in wordDict:
        print(w, wordDict[w])
    

      

    排序

    wordSet=set(wordlist)
    for w in wordSet:
        wordDict[w]=wordlist.count(w)
    dictList=list(wordDict.items())
    dictList.sort(key=lambda x:x[1],reverse=True)
     
    print(dictList)
    

      

    排除语法型词汇,代词、冠词、连词

    exclude={'the','a','an','and','of','with','to','by','am','are','is','which','on'}
    wordSet=set(wordlist)-exclude
    for w in wordSet:
        wordDict[w]=wordlist.count(w)
    dictList=list(wordDict.items())
    dictList.sort(key=lambda x:x[1],reverse=True)
     
    print(dictList)
    

    输出词频最大TOP20以及将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。

    for i in range(20):
        print(dictList[i])
    
    
    print('author:xujinpei')
    f=open('news.txt','r')
    news=f.read()
    f.close()
    print(news)
    

      中文频词

        print(dictList)
    t = '在国内电影票房连创新高的同时,电视剧市场同样保持快速发展,2017年,我国电视剧生产量达310部、1.3万集,继续稳居世界第一。《中国诗词大会》《国家宝藏》《如果国宝会说话》等电视综艺节目、纪录片,生动传播了中华优秀传统文化。
    text = jieba.cut(t)
    print(list(jieba.cut(t)))
    

      

  • 相关阅读:
    vue 中的虚拟dom
    Vue基操
    表头固定,表的主体设置滚动条,同时解决错位问题
    AngularJS处理服务器端返回的JSON数据的格式问题
    jQuery ajax-param()
    Bootstrap中内联单选按钮
    angularJS中控制器和作用范围
    如何理解MVC?
    CSS3动画简介以及动画库animate.css的使用
    UGUI实现打字的效果
  • 原文地址:https://www.cnblogs.com/zhongchengzhe/p/8658569.html
Copyright © 2011-2022 走看看