zoukankan html css js c++ java

NLP（二）：jieba高频词提取

高频词提取（TF，Term Frequency）,高频词指在文档中出现频率较高并且有用的词。

所以我们要做的工作有：加载数据，去除停用词，用字典统计高频词，输出top10的高频词。

import glob
import random
import jieba

def getContent(path):
    with open(path, encoding='utf-8', errors='ignore') as f:
        content = ''
        for line in f:
           #去除空行
            line = line.strip()
            content += line
        return content
    
def get_TF(words, topK=10):
    tf_dic = {}
    #遍历words中的每个词，如果这个词在tf_dic中出现过，则令其加一。
    for w in words:
        tf_dic[w] = tf_dic.get(w, 0) + 1
        #将字典tf_dic排序后取出前topK.
    return sorted(tf_dic.items(), key = lambda x: x[1], reverse=True)[:topK]

def stop_words(path):
    with open(path,encoding='utf-8') as f:
        return [l.strip() for l in f]
    
#修改cut函数,path是你的停用词表所放的位置
def cut(content,path):
    split_words = [x for x in jieba.cut(content) if x not in stop_words(path)]
    return split_words 


def main():
    files=glob.glob('C:/Users/Administrator/Desktop/stop_words/news/*.txt')
    corpus=[getContent(x) for x in files]
    sample_inx=random.randint(0,len(corpus))
    split_words=cut(corpus[sample_inx],'C:/Users/Administrator/Desktop/stop_words/stop_words.txt')
    print('样本之一：'+corpus[sample_inx])
    print('样本的分词效果：'+'/'.join(split_words))
    print('样本的topk10词为：'+str(get_TF(split_words)))
    
main()

运行结果如下：

这个代码需注意的地方有：将新闻复制粘贴到txt文件中注意需用utf8编码，然后在代码中体现为open函数中需要加‘encoding='utf-8'’；输出的结果是一个列表，列表中有许多元组，由词和词频构成。

在默认情况下，jieba采用常规切词来提取高频词汇，但是在特定背景，诸如医学，娱乐，体育，科学类文本下，需要该领域自己的特定词典，jieba分词允许我们加载自定义词典，代码如下：

jieba.load_userdict('./data/user_dict.utf8')

查看全文

相关阅读:
每日一题20180325
Linux下MySQL表名区分大小写
 CentOS删除编译安装的Python3
HTTPS配置
 测试 js 方法运行时间（转）
使用dbutils进行批处理
 oracle生成主键
 JDBC学习笔记(10)——调用函数&存储过程（转）
easyui Draggable
blob

原文地址：https://www.cnblogs.com/liuxiangyan/p/12458369.html