zoukankan html css js c++ java

【Python】词频统计

需求：一篇文章，出现了哪些词？哪些词出现得最多？

英文文本词频统计

统计英文词频分为两步：

文本去噪及归一化
使用字典表达词频

代码：

#CalHamletV1.py
def getText():
    txt = open("hamlet.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt
 
hamletTxt = getText()
words  = hamletTxt.split()
counts = {}
for word in words:           
    counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

运行结果：

the        1138
and         965
to          754
of          669
you         550
i           542
a           542
my          514
hamlet      462
in          436

中文文本词频统计

中文文本：《三国演义》分析人物

统计中文词频分为两步：

中文文本分词
使用字典表达词频

#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

运行结果：

曹操      953
孔明  836
将军  772
却说  656
玄德  585
关公  510
丞相  491
二人  469
不可  440
荆州  425
玄德曰     390
孔明曰     390
不能  384
如此  378
张飞  358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步：

中文文本分词
使用字典表达词频
扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

考研英语词频统计

将词频统计应用到考研英语中，我们可以统计出出现次数较多的关键单词。
文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.py
def getText():
    txt = open("86_17_1_2.txt", "r").read()
    txt = txt.lower()
    for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt

pyTxt = getText()   #获得没有任何标点的txt文件
words  = pyTxt.split()  #获得单词
counts = {} #字典，键值对
excludes = {"the", "a", "of", "to", "and", "in", "b", "c", "d", "is",
            "was", "are", "have", "were", "had", "that", "for", "it",
            "on", "be", "as", "with", "by", "not", "their", "they",
            "from", "more", "but", "or", "you", "at", "has", "we", "an",
            "this", "can", "which", "will", "your", "one", "he", "his", "all", "people", "should", "than", "points", "there", "i", "what", "about", "new", "if", "”",
            "its", "been", "part", "so", "who", "would", "answer", "some", "our", "may", "most", "do", "when", "1", "text", "section", "2", "many", "time", "into", 
            "10", "no", "other", "up", "following", "【答案】", "only", "out", "each", "much", "them", "such", "world", "these", "sheet", "life", "how", "because", "3", "even", 
            "work", "directions", "use", "could", "now", "first", "make", "years", "way", "20", "those", "over", "also", "best", "two", "well", "15", "us", "write", "4", "5", "being", "social", "read", "like", "according", "just", "take", "paragraph", "any", "english", "good", "after", "own", "year", "must", "american", "less", "her", "between", "then", "children", "before", "very", "human", "long", "while", "often", "my", "too", 
            "40", "four", "research", "author", "questions", "still", "last", "business", "education", "need", "information", "public", "says", "passage", "reading", "through", "women", "she", "health", "example", "help", "get", "different", "him", "mark", "might", "off", "job", "30", "writing", "choose", "words", "economic", "become", "science", "society", "without", "made", "high", "students", "few", "better", "since", "6", "rather", "however", "great", "where", "culture", "come", 
            "both", "three", "same", "government", "old", "find", "number", "means", "study", "put", "8", "change", "does", "today", "think", "future", "school", "yet", "man", "things", "far", "line", "7", "13", "50", "used", "states", "down", "12", "14", "16", "end", "11", "making", "9", "another", "young", "system", "important", "letter", "17", "chinese", "every", "see", "s", "test", "word", "century", "language", "little", 
            "give", "said", "25", "state", "problems", "sentence", "food", "translation", "given", "child", "18", "longer", "question", "back", "don’t", "19", "against", "always", "answers", "know", "having", "among", "instead", "comprehension", "large", "35", "want", "likely", "keep", "family", "go", "why", "41", "home", "law", "place", "look", "day", "men", "22", "26", "45", "it’s", "others", "companies", "countries", "once", "money", "24", "though", 
            "27", "29", "31", "say", "national", "ii", "23", "based", "found", "28", "32", "past", "living", "university", "scientific", "–", "36", "38", "working", "around", "data", "right", "21", "jobs", "33", "34", "possible", "feel", "process", "effect", "growth", "probably", "seems", "fact", "below", "37", "39", "history", "technology", "never", "sentences", "47", "true", "scientists", "power", "thought", "during", "48", "early", "parents", 
            "something", "market", "times", "46", "certain", "whether", "000", "did", "enough", "problem", "least", "federal", "age", "idea", "learn", "common", "political", "pay", "view", "going", "attention", "happiness", "moral", "show", "live", "until", "52", "49", "ago", "percent", "stress", "43", "44", "42", "meaning", "51", "e", "iii", "u", "60", "anything", "53", "55", "cultural", "nothing", "short", "100", "water", "car", "56", "58", "【解析】", "54", "59", "57", "v", "。","63", "64", "65", "61", "62", "66", "70", "75", "f", "【考点分析】", "67", "here", "68",  "71", "72", "69", "73", "74", "选项a", "ourselves", "teachers", "helps", "参考范文", "gdp", "yourself", "gone", "150"}
for word in words:
    if word not in excludes:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

x = len(counts)
print(x)

r = 0

next = eval(input("1继续"))

while next == 1:
    r += 100
    for i in range(r, r+100):
        word, count = items[i]
        print (""{}"".format(word), end = ", ")
    next = eval(input("1继续"))

查看全文

相关阅读:
【PENNI】2020-ICML-PENNI: Pruned Kernel Sharing for Efficient CNN Inference-论文阅读
 【BlockSwap】2020-ICLR-BlockSwap: Fisher-guided Block Substitution for Network Compression on a Budget-论文阅读
 【MCUNet】2020-NIPS-MCUNet Tiny Deep Learning on IoT Devices-论文阅读
 【FSNet】2020-ICLR-FSNet Compression of Deep Convolutional Neural Networks by Filter Summary-论文阅读
 【Shape Adaptor】2020-ECCV-Shape Adaptor: A Learnable Resizing Module-论文阅读
 【FairNAS】2019-arxiv-FairNAS Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search-论文阅读
 【EagleEye】2020-ECCV-EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning-论文阅读
 【DMC】2020-CVPR-Discrete Model Compression with Resource Constraint for Deep Neural Networks-论文阅读
 【RegNet】2020-CVPR-Designing Network Design Spaces-论文阅读
 【DropNet】2020-ICML-DropNet Reducing Neural Network Complexity via Iterative Pruning-论文阅读

原文地址：https://www.cnblogs.com/blknemo/p/12996830.html