zoukankan      html  css  js  c++  java
  • jieba分割热词,统计频率,以及停用词

    import jieba
    from collections import Counter
    
    if __name__ == '__main__':
        filehandle = open("boke.txt", "r", encoding='utf-8',errors='ignore');
        mystr = filehandle.read()
        seg_list = jieba.cut(mystr)  # 默认是精确模式
        print(seg_list)
        # all_words = cut_words.split()
        # print(all_words)
        stopwords = {}.fromkeys([line.rstrip() for line in open("stop.txt", "r", encoding='utf-8',errors='ignore')])
        c = Counter()
        for x in seg_list:
            if x not in stopwords:
                if len(x) > 1 and x != '
    ':
                    c[x] += 1
        print('
    词频统计结果:')
        for (k, v) in c.most_common(50):  # 输出词频最高的前两个词
            print("%s:%d" % (k, v))
    
        # print(mystr)
        filehandle.close();
        # seg2 = jieba.cut("好好学学python,有用。", cut_all=False)
        # print("精确模式(也是默认模式):", ' '.join(seg2))
    

      参考了龙哥的代码。自己代码总是出现转码问题

  • 相关阅读:
    ff与ie 的关于js兼容性
    CSS清除浮动的方法
    java8 LocalDateTime
    BigDecimal
    JAVA将 Word 文档转换为 PDF
    Ionic4
    SpringBoot后端统一格式返回
    SpringBoot集成JWT
    Java Lombok
    SpringBoot 中通过 CORS 解决跨域问题
  • 原文地址:https://www.cnblogs.com/1061321925wu/p/12293488.html
Copyright © 2011-2022 走看看