zoukankan      html  css  js  c++  java
  • Python数据挖掘-关键字提取

    jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

    sentence 为待提取的文本
    topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
    withWeight 为是否一并返回关键词权重值,默认值为 False
    allowPOS 仅包括指定词性的词,默认值为空,即不筛选

    模块:os、codecs、pandas、jieba、

    import os
    import codecs
    import pandas
    import jieba
    import jieba.analyse
    
    filePaths = []
    contents = []
    tag1s = []
    tag2s = []
    tag3s = []
    tag4s = []
    tag5s = []
    
    for root, dirs, files in os.walk(
        "D:\PDM\2.6\SogouC.mini\Sample\"
    ):
        for name in files:
            filePath = root + '\' + name;
            f = codecs.open(filePath, 'r', 'utf-8')
            content = f.read().strip()
            f.close()
            tags = jieba.analyse.extract_tags(content, topK=5)
            filePaths.append(filePath)
            contents.append(content)
            tag1s.append(tags[0])
            tag2s.append(tags[1])
            tag3s.append(tags[2])
            tag4s.append(tags[3])
            tag5s.append(tags[4])
    
    tagDF = pandas.DataFrame({
        'filePath': filePaths, 
        'content': contents, 
        'tag1': tag1s, 
        'tag2': tag2s, 
        'tag3': tag3s, 
        'tag4': tag4s, 
        'tag5': tag5s
    })
  • 相关阅读:
    hibernate入门
    struts文件上传
    Struts的增删改查
    struts入门
    Maven配置以及环境搭配
    layui增删改查
    easyui三
    A
    C. Permutation Cycle
    E
  • 原文地址:https://www.cnblogs.com/U940634/p/9736347.html
Copyright © 2011-2022 走看看