zoukankan      html  css  js  c++  java
  • 文本词频统计

    文本词频统计

    词频:单词出现的次数

    # 统计英文:
    f = open('F:实习pythonhamlet','r',encoding='utf8')
    data = f.read().lower()
    data_split = data.split(' ')
    
    count_dict = {}
    for word in data_split:
        if word not in count_dict:
            count_dict[word] = 1
        else:
            count_dict[word] += 1
    
    def func(i):
        return i[1]
    lt = list(count_dict.items())
    lt.sort(key = func)
    
    lt.reverse()            # 运行结果由大到小排列
    for i in lt[0:10]:
        print(f'{i[0]:^7}{i[1]^5}')
    
    # 统计中文:
    import jieba            # 导入一个jieba库,用来分词
    f = open(r'F:实习python719	hreekingdoms','r',encoding='utf8')
    data = f.read()
    data_jieba = jieba.lcut(data)
    
    count_dict = {}
    for word in data_jieba:
        if len(word) == 1:
            continue
        if word in {"将军", "却说", "荆州", "二人", "不可", "不能", "如此"}:
            continue
        if '曰' in word:
            word = word.replace('曰','')
        if word not in count_dict:
            count_dict[word] = 1
        else:
            count_dict[word] += 1
    
    def func(i):
        return i[1]
    data_list = list(count_dict.items())
    data_list.sort(key = func)
    
    data_list.reverse()
    print(data_list)
    
  • 相关阅读:
    [COCI20142015#1] Kamp
    [CEOI2007]树的匹配Treasury
    [JLOI2016/SHOI2016]侦察守卫
    [POI2015]MOD
    [BJOI2017]机动训练
    [九省联考2018]一双木棋chess
    [清华集训2012]串珠子
    [POI2014]ZALFreight
    [SHOI2009]舞会
    [COCI2019]Mobitel
  • 原文地址:https://www.cnblogs.com/yushan1/p/11213414.html
Copyright © 2011-2022 走看看