zoukankan      html  css  js  c++  java
  • python使用jieba实现简单的词频统计

    import jieba
    def getText():
        txt=open("hamlet.txt","r").read()
        txt=txt.lower()
        for ch in '|"#$%&()*+,-./:;<>+?@[\]^_{|}~':
            txt=txt.replace(ch," ")
        return txt
    harmTxt=getText()
    words=harmTxt.split()
    counts={}
    for word in words:
        counts[word]=counts.get(word,0)+1
    
    items=list(counts.items())
    
    #按照第二个元素有大到小排序
    items.sort(key=lambda  x:x[1],reverse=True)
    
    for i in range(10):
        word, count=items[i]
        print(word,end=":")
        print(count)


    运行结果

    the:1138
    and:965
    to:754
    of:668
    you:549
    a:542
    i:540
    my:514
    hamlet:456
    in:436
    
    import jieba
    txt=open("threekingdoms.txt","r",encoding="utf-8").read()
    #总结一些不是人名的词
    excludes={"将军","却说","二人","荆州","二人","不可","不能","如此","商议","不能","如此","左右","引兵","如何","主公"}
    words=jieba.lcut(txt)
    counts={}
    for word in words:
        if len(word)==1:
            continue
        elif word=="诸葛亮" or word=="孔明曰":
            rword="孔明"
        elif word=="关公" or word=="云长":
            rword="关羽"
        elif word=="玄德" or word=="玄德曰":
            rword="刘备"
        elif word=="孟德" or word=="丞相":
            rword="曹操"
        else:
            rword=word
        counts[rword]=counts.get(rword,0)+1
    for word in excludes:
        del counts[word]
    items=list(counts.items())
    items.sort(key=lambda x:x[1],reverse=True)
    for i in range(10):
        word,count=items[i]
        print(word,end=":")
        print(count)


    运行结果:

    曹操:1451
    孔明:1383
    刘备:1252
    关羽:784
    张飞:358
    军士:317
    吕布:300
    军马:293
    赵云:278
    次日:271
    
  • 相关阅读:
    Hexo简介
    MarkDown基本语法
    Github 协同开发
    Java基础10:全面解读Java异常
    Java基础8:深入理解内部类
    Java基础9:解读Java回调机制
    Java基础6:代码块与代码加载顺序
    Java基础7:关于Java类和包的那些事
    java基础4:深入理解final关键字
    Java基础5:抽象类和接口
  • 原文地址:https://www.cnblogs.com/837634902why/p/13721355.html
Copyright © 2011-2022 走看看