zoukankan      html  css  js  c++  java
  • 文本词频统计

    文本词频统计

    词频:单词出现的次数

    # 统计英文:
    f = open('F:实习pythonhamlet','r',encoding='utf8')
    data = f.read().lower()
    data_split = data.split(' ')
    
    count_dict = {}
    for word in data_split:
        if word not in count_dict:
            count_dict[word] = 1
        else:
            count_dict[word] += 1
    
    def func(i):
        return i[1]
    lt = list(count_dict.items())
    lt.sort(key = func)
    
    lt.reverse()            # 运行结果由大到小排列
    for i in lt[0:10]:
        print(f'{i[0]:^7}{i[1]^5}')
    
    # 统计中文:
    import jieba            # 导入一个jieba库,用来分词
    f = open(r'F:实习python719	hreekingdoms','r',encoding='utf8')
    data = f.read()
    data_jieba = jieba.lcut(data)
    
    count_dict = {}
    for word in data_jieba:
        if len(word) == 1:
            continue
        if word in {"将军", "却说", "荆州", "二人", "不可", "不能", "如此"}:
            continue
        if '曰' in word:
            word = word.replace('曰','')
        if word not in count_dict:
            count_dict[word] = 1
        else:
            count_dict[word] += 1
    
    def func(i):
        return i[1]
    data_list = list(count_dict.items())
    data_list.sort(key = func)
    
    data_list.reverse()
    print(data_list)
    
  • 相关阅读:
    短信验证倒计时60s
    jquery select省市区三级联动
    C# 遍历文本框
    html formData 数据 提交和 .netMVC接收
    jq遍历table 下的 td 添加类
    jq 替换DOM layeui 不刷新
    jq 获取表单所有数据
    js 二级联动
    MVC 下载文件
    MVC 上传文件
  • 原文地址:https://www.cnblogs.com/yushan1/p/11213414.html
Copyright © 2011-2022 走看看