zoukankan      html  css  js  c++  java
  • python使用jieba实现简单的词频统计

    import jieba
    def getText():
        txt=open("hamlet.txt","r").read()
        txt=txt.lower()
        for ch in '|"#$%&()*+,-./:;<>+?@[\]^_{|}~':
            txt=txt.replace(ch," ")
        return txt
    harmTxt=getText()
    words=harmTxt.split()
    counts={}
    for word in words:
        counts[word]=counts.get(word,0)+1
    
    items=list(counts.items())
    
    #按照第二个元素有大到小排序
    items.sort(key=lambda  x:x[1],reverse=True)
    
    for i in range(10):
        word, count=items[i]
        print(word,end=":")
        print(count)


    运行结果

    the:1138
    and:965
    to:754
    of:668
    you:549
    a:542
    i:540
    my:514
    hamlet:456
    in:436
    
    import jieba
    txt=open("threekingdoms.txt","r",encoding="utf-8").read()
    #总结一些不是人名的词
    excludes={"将军","却说","二人","荆州","二人","不可","不能","如此","商议","不能","如此","左右","引兵","如何","主公"}
    words=jieba.lcut(txt)
    counts={}
    for word in words:
        if len(word)==1:
            continue
        elif word=="诸葛亮" or word=="孔明曰":
            rword="孔明"
        elif word=="关公" or word=="云长":
            rword="关羽"
        elif word=="玄德" or word=="玄德曰":
            rword="刘备"
        elif word=="孟德" or word=="丞相":
            rword="曹操"
        else:
            rword=word
        counts[rword]=counts.get(rword,0)+1
    for word in excludes:
        del counts[word]
    items=list(counts.items())
    items.sort(key=lambda x:x[1],reverse=True)
    for i in range(10):
        word,count=items[i]
        print(word,end=":")
        print(count)


    运行结果:

    曹操:1451
    孔明:1383
    刘备:1252
    关羽:784
    张飞:358
    军士:317
    吕布:300
    军马:293
    赵云:278
    次日:271
    
  • 相关阅读:
    排序比较与总结
    Oracle误删恢复
    DSP TMS320C6000基础学习(4)—— cmd文件分析
    NYOJ 488 素数环
    Oracle常用语句记录
    Cocoa编程开发者手册
    Unity3d物体模型(实现旋转缩放平移自动旋转)
    django 创建一个通用视图
    ContentResolver + SqliteOpenHelper + ContentProvider 理解
    线性回归,logistic回归和一般回归
  • 原文地址:https://www.cnblogs.com/837634902why/p/13721355.html
Copyright © 2011-2022 走看看