zoukankan      html  css  js  c++  java
  • jieba分词讲解2

    3. 关键词提取

    基于 TF-IDF 算法的关键词抽取

    import jieba.analyse

    • jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
      • sentence 为待提取的文本
      • topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
      • withWeight 为是否一并返回关键词权重值,默认值为 False
      • allowPOS 仅包括指定词性的词,默认值为空,即不筛选
    • jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例,idf_path 为 IDF 频率文件

    代码示例 (关键词提取):https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py(代码如下)

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags.py [file name] -k [top k]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    content = open(file_name, 'rb').read()
    
    tags = jieba.analyse.extract_tags(content, topK=topK)
    
    print(",".join(tags))
    复制代码
    复制代码
    复制代码

    关键词提取所使用逆向文件频率(IDF)文本语料库可以切换成自定义语料库的路径

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags_idfpath.py [file name] -k [top k]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    content = open(file_name, 'rb').read()
    
    jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");#与extract_tags相比多了这一句
    
    tags = jieba.analyse.extract_tags(content, topK=topK)
    
    print(",".join(tags))
    复制代码
    复制代码
    复制代码

    关键词提取所使用停止词(Stop Words)文本语料库可以切换成自定义语料库的路径

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags_stop_words.py [file name] -k [top k]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    content = open(file_name, 'rb').read()
    
    jieba.analyse.set_stop_words("../extra_dict/stop_words.txt")#停用词
    jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");#idf词频
    
    tags = jieba.analyse.extract_tags(content, topK=topK)
    
    print(",".join(tags))
    复制代码
    复制代码
    复制代码

    关键词一并返回关键词权重值示例

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags_with_weight.py [file name] -k [top k] -w [with weight=1 or 0]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    parser.add_option("-w", dest="withWeight")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    if opt.withWeight is None:
        withWeight = False
    else:
        if int(opt.withWeight) is 1:
            withWeight = True
        else:
            withWeight = False
    
    content = open(file_name, 'rb').read()
    
    tags = jieba.analyse.extract_tags(content, topK=topK, withWeight=withWeight)
    
    if withWeight is True:
        for tag in tags:
            print("tag: %s\t\t weight: %f" % (tag[0],tag[1]))
    else:
        print(",".join(tags))
    复制代码
    复制代码
    复制代码
  • 相关阅读:
    Android自定义之仿360Root大师水纹效果
    Android之TextView的Span样式源码剖析
    Android之TextView的样式类Span的使用详解
    随着ScrollView的滑动,渐渐的执行动画View
    仿微信主界面导航栏图标字体颜色的变化
    android自定义之 5.0 风格progressBar
    Android性能优化之内存篇
    Android性能优化之运算篇
    How to install Zabbix5.0 LTS version with Yum on the CentOS 7.8 system?
    How to install Zabbix4.0 LTS version with Yum on the Oracle Linux 7.3 system?
  • 原文地址:https://www.cnblogs.com/mjhjl/p/15665106.html
Copyright © 2011-2022 走看看