zoukankan      html  css  js  c++  java
  • 【Python】词频统计

    • 需求:一篇文章,出现了哪些词?哪些词出现得最多?

    英文文本词频统计

    英文文本:Hamlet 分析词频

    统计英文词频分为两步:

    • 文本去噪及归一化
    • 使用字典表达词频

    代码:

    #CalHamletV1.py
    def getText():
        txt = open("hamlet.txt", "r").read()
        txt = txt.lower()
        for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
            txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
        return txt
     
    hamletTxt = getText()
    words  = hamletTxt.split()
    counts = {}
    for word in words:           
        counts[word] = counts.get(word,0) + 1
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True) 
    for i in range(10):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))

    运行结果:

    the        1138
    and         965
    to          754
    of          669
    you         550
    i           542
    a           542
    my          514
    hamlet      462
    in          436

    中文文本词频统计

    中文文本:《三国演义》分析人物

    统计中文词频分为两步:

    • 中文文本分词
    • 使用字典表达词频
    #CalThreeKingdomsV1.py
    import jieba
    txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
    words  = jieba.lcut(txt)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue
        else:
            counts[word] = counts.get(word,0) + 1
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True) 
    for i in range(15):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))

    运行结果:

    曹操      953
    孔明  836
    将军  772
    却说  656
    玄德  585
    关公  510
    丞相  491
    二人  469
    不可  440
    荆州  425
    玄德曰     390
    孔明曰     390
    不能  384
    如此  378
    张飞  358

    能很明显的看到有一些不相关或重复的信息

    优化版本

    统计中文词频分为三步:

    • 中文文本分词
    • 使用字典表达词频
    • 扩展程序解决问题

    我们将不相关或重复的信息放在 excludes 集合里面进行排除。

    #CalThreeKingdomsV2.py
    import jieba
    excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
    txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
    words  = jieba.lcut(txt)
    counts = {}
    for word in words:
        if len(word) == 1:
            continue
        elif word == "诸葛亮" or word == "孔明曰":
            rword = "孔明"
        elif word == "关公" or word == "云长":
            rword = "关羽"
        elif word == "玄德" or word == "玄德曰":
            rword = "刘备"
        elif word == "孟德" or word == "丞相":
            rword = "曹操"
        else:
            rword = word
        counts[rword] = counts.get(rword,0) + 1
    for word in excludes:
        del counts[word]
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True) 
    for i in range(10):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))

    考研英语词频统计

    将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。
    文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

    # CalHamletV1.py
    def getText():
        txt = open("86_17_1_2.txt", "r").read()
        txt = txt.lower()
        for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
            txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
        return txt
    
    pyTxt = getText()   #获得没有任何标点的txt文件
    words  = pyTxt.split()  #获得单词
    counts = {} #字典,键值对
    excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",
                "was", "are", "have", "were", "had", "that", "for", "it",
                "on", "be", "as", "with", "by", "not", "their", "they",
                "from", "more", "but", "or", "you", "at", "has", "we", "an",
                "this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",
                "its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", 
                "10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", 
                "work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", 
                "40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", 
                "both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", 
                "give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", 
                "27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", 
                "something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68",  "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"}
    for word in words:
        if word not in excludes:
            counts[word] = counts.get(word,0) + 1
    items = list(counts.items())
    items.sort(key=lambda x:x[1], reverse=True) 
    for i in range(10):
        word, count = items[i]
        print ("{0:<10}{1:>5}".format(word, count))
    
    x = len(counts)
    print(x)
    
    r = 0
    
    next = eval(input("1继续"))
    
    while next == 1:
        r += 100
        for i in range(r, r+100):
            word, count = items[i]
            print (""{}"".format(word), end = ", ")
        next = eval(input("1继续"))
  • 相关阅读:
    使用Google浏览器做真机页面调试
    JavaScript从作用域到闭包
    用canvas实现一个colorpicker
    你还在为移动端选择器picker插件而捉急吗?
    我是这样写文字轮播的
    高性能JS-DOM
    ExtJs4学习(四):Extjs 中id与itemId的差别
    MongoDB 安装与启动
    UML应用:业务内涵的分析抽象&amp;表达
    MySQL 错误日志(Error Log)
  • 原文地址:https://www.cnblogs.com/blknemo/p/12996830.html
Copyright © 2011-2022 走看看