zoukankan      html  css  js  c++  java
  • jieba分词讲解2

    3. 关键词提取

    基于 TF-IDF 算法的关键词抽取

    import jieba.analyse

    • jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
      • sentence 为待提取的文本
      • topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
      • withWeight 为是否一并返回关键词权重值,默认值为 False
      • allowPOS 仅包括指定词性的词,默认值为空,即不筛选
    • jieba.analyse.TFIDF(idf_path=None) 新建 TFIDF 实例,idf_path 为 IDF 频率文件

    代码示例 (关键词提取):https://github.com/fxsjy/jieba/blob/master/test/extract_tags.py(代码如下)

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags.py [file name] -k [top k]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    content = open(file_name, 'rb').read()
    
    tags = jieba.analyse.extract_tags(content, topK=topK)
    
    print(",".join(tags))
    复制代码
    复制代码
    复制代码

    关键词提取所使用逆向文件频率(IDF)文本语料库可以切换成自定义语料库的路径

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags_idfpath.py [file name] -k [top k]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    content = open(file_name, 'rb').read()
    
    jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");#与extract_tags相比多了这一句
    
    tags = jieba.analyse.extract_tags(content, topK=topK)
    
    print(",".join(tags))
    复制代码
    复制代码
    复制代码

    关键词提取所使用停止词(Stop Words)文本语料库可以切换成自定义语料库的路径

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags_stop_words.py [file name] -k [top k]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    content = open(file_name, 'rb').read()
    
    jieba.analyse.set_stop_words("../extra_dict/stop_words.txt")#停用词
    jieba.analyse.set_idf_path("../extra_dict/idf.txt.big");#idf词频
    
    tags = jieba.analyse.extract_tags(content, topK=topK)
    
    print(",".join(tags))
    复制代码
    复制代码
    复制代码

    关键词一并返回关键词权重值示例

    复制代码
    复制代码
    复制代码
    import sys
    sys.path.append('../')
    
    import jieba
    import jieba.analyse
    from optparse import OptionParser
    
    USAGE = "usage:    python extract_tags_with_weight.py [file name] -k [top k] -w [with weight=1 or 0]"
    
    parser = OptionParser(USAGE)
    parser.add_option("-k", dest="topK")
    parser.add_option("-w", dest="withWeight")
    opt, args = parser.parse_args()
    
    
    if len(args) < 1:
        print(USAGE)
        sys.exit(1)
    
    file_name = args[0]
    
    if opt.topK is None:
        topK = 10
    else:
        topK = int(opt.topK)
    
    if opt.withWeight is None:
        withWeight = False
    else:
        if int(opt.withWeight) is 1:
            withWeight = True
        else:
            withWeight = False
    
    content = open(file_name, 'rb').read()
    
    tags = jieba.analyse.extract_tags(content, topK=topK, withWeight=withWeight)
    
    if withWeight is True:
        for tag in tags:
            print("tag: %s\t\t weight: %f" % (tag[0],tag[1]))
    else:
        print(",".join(tags))
    复制代码
    复制代码
    复制代码
  • 相关阅读:
    C++模板实战6:迭代器
    Hacking up an armv7s library
    Android之ListView分页数据加载
    Android 命令行打包和签名
    django 自定模板标签的注册
    [置顶] 高效能人士的七个习惯读书笔记(二)
    价格战拉上了Android平板电脑
    Synergy 多系统共享鼠标键盘 Windows 和 Mac 完全配置教程
    global planner源码阅读
    源码安装eigen
  • 原文地址:https://www.cnblogs.com/mjhjl/p/15665106.html
Copyright © 2011-2022 走看看