这学期跟着MOOC的嵩天老师在学Python,但是有蛮多虽然跟着视频敲,但实际上自己用vscode做出问题的案例,所以记一下以后人家百度搜比较快。(老是读不到文件..之类的
#英文文本词频统计
#CalaliceV1.py def getText(): txt = open("11.txt","r",encoding='utf-8').read() txt = txt.lower() #将所有大写变小写 for ch in '|"$%&*()^#@;:_-.><!~`[\]+=?/“”{|}': txt=txt.replace(ch," ")#将特殊符号替换为空格符 return txt #得到一个没有符号的 都是小写的 单词间都用空格间隔开的txt aliceTxt=getText() words=aliceTxt.split()#split采用空格分隔单词,以列表形式返回 counts={} for word in words: counts[word]=counts.get(word,0)+1 items=list(counts.items()) items.sort(key=lambda x:x[1],reverse=True) for i in range(10): word,count=items[i] print("{0:<10}{1:>5}".format(word,count))
#中文文本词频统计
import jieba txt=open("sangou.txt","rb").read() excludes={"将军","却说","荆州","二人","不可","如此","不能","商议","如何","军马","引兵","次日","大喜","天下","于是","东吴","今日","不敢","陛下","人马","左右","军士","主公","魏兵","都督","一人","不知","汉中","众将","只见","后主","蜀兵","大叫","上马","此人","先主","城中","太守","天子","背后","后人"} words=jieba.lcut(txt) counts={} for word in words: if len(word)==1: continue elif word=='诸葛亮'or word=='孔明曰': rword='孔明' elif word=='关公'or word=='云长': rword='关羽' elif word=='玄德'or word=='玄德曰': rword='刘备' elif word=='孟德' or word=='丞相': rword='曹操' else: rword=word counts[rword]=counts.get(rword,0)+1 for word in excludes: del counts[word] items=list(counts.items()) items.sort(key=lambda x:x[1],reverse=True) for i in range(15): word,count=items[i] print("{0:<10}{1:>5}".format(word,count))
注意,要读的文件要放在上一级目录,而不是跟代码放在一起