zoukankan      html  css  js  c++  java
  • Python-文本词频统计

    这学期跟着MOOC的嵩天老师在学Python,但是有蛮多虽然跟着视频敲,但实际上自己用vscode做出问题的案例,所以记一下以后人家百度搜比较快。(老是读不到文件..之类的

    #英文文本词频统计

     #CalaliceV1.py
     def getText():
         txt = open("11.txt","r",encoding='utf-8').read()
         txt = txt.lower() #将所有大写变小写
         for ch in '|"$%&*()^#@;:_-.><!~`[\]+=?/“”{|}':
             txt=txt.replace(ch," ")#将特殊符号替换为空格符
         return txt
     #得到一个没有符号的 都是小写的 单词间都用空格间隔开的txt
     aliceTxt=getText()
     words=aliceTxt.split()#split采用空格分隔单词,以列表形式返回
     counts={}
     for word in words:
         counts[word]=counts.get(word,0)+1
     items=list(counts.items())
     items.sort(key=lambda x:x[1],reverse=True)
     for i in range(10):
         word,count=items[i]
         print("{0:<10}{1:>5}".format(word,count))

    #中文文本词频统计

     import jieba
     txt=open("sangou.txt","rb").read()
     excludes={"将军","却说","荆州","二人","不可","如此","不能","商议","如何","军马","引兵","次日","大喜","天下","于是","东吴","今日","不敢","陛下","人马","左右","军士","主公","魏兵","都督","一人","不知","汉中","众将","只见","后主","蜀兵","大叫","上马","此人","先主","城中","太守","天子","背后","后人"}
     words=jieba.lcut(txt)
     counts={}
     for word in words:
         if len(word)==1:
             continue
         elif word=='诸葛亮'or word=='孔明曰':
             rword='孔明'
         elif word=='关公'or word=='云长':
             rword='关羽'
         elif word=='玄德'or word=='玄德曰':
             rword='刘备'
         elif word=='孟德' or word=='丞相':
             rword='曹操'
         else:
             rword=word
         counts[rword]=counts.get(rword,0)+1
     for word in excludes:
         del counts[word]
     items=list(counts.items())
     items.sort(key=lambda x:x[1],reverse=True)
     for i in range(15):
         word,count=items[i]
         print("{0:<10}{1:>5}".format(word,count))
    注意,要读的文件要放在上一级目录,而不是跟代码放在一起
  • 相关阅读:
    python练习:抓取统计log内ip数量
    再来说一说sudo
    Nginx 1 Web Server Implementation Cookbook系列--(1)debug mode
    MySQL 5.7 关闭严格模式
    beanstalkd 消息队列
    国内git项目托管平台
    Git忽略.gitignore规则不生效的解决办法
    Ubuntu 上搭建 Samba 服务器
    Memcached+PHP+Mysql+Linux 实践
    php use memcached in ubuntu 14.04
  • 原文地址:https://www.cnblogs.com/Nickyl07/p/12727388.html
Copyright © 2011-2022 走看看