zoukankan html css js c++ java

TF-IDF原理与实现

TF-IDF 原理与实现

1.原理

[TF-IDF = tf_{t,d} imes idf_{t}\ tf_{t,d} = frac{术语t在文档d中出现的次数}{文档d的总术语数}\ idf_{t} = log(frac{文档d总数}{包含术语t的文档数}) ]

2. 伪代码

3.实现

同级目录下需要有 documents 文件夹，在该文件夹下存放文档集。

# !/usr/bin/python
# -*- coding: utf-8 -*-

import os
import math


def set_doc():
    docs = dict()
    for d in os.listdir(os.getcwd() + os.sep + "documents"):
        docs[d] = list()
        with open(os.getcwd() + os.sep + "documents" + os.sep + d, encoding="ANSI") as f:
            for line in f:
                for word in line.strip().split(" "):
                    docs[d].append(word)
    return docs


def tf(docs, keyword):
    tfs = dict()
    for doc in docs:
        for word in docs[doc]:
            if keyword in word:
                try:
                    tfs[doc] = tfs[doc] + 1
                except KeyError:
                    tfs[doc] = 1
        try:
            tfs[doc] = tfs[doc] / len(docs[doc])
        except KeyError:
            tfs[doc] = int(0)
    return tfs


def idf(docs, keyword):
    doc_with_keyword = set()
    for doc in docs:
        for word in docs[doc]:
            if keyword in word:
                doc_with_keyword.add(doc)
    return math.log(len(docs) / len(doc_with_keyword))


def tf_idf(tfs, term_idf):
    term_tf_idf = dict()
    for doc in tfs:
        term_tf_idf[doc] = tfs[doc] * term_idf
    return term_tf_idf


if __name__ == "__main__":
    keyword = "people"
    docs = set_doc()
    tfs = tf(docs, keyword)
    term_idf = idf(docs, keyword)
    term_tf_idf = tf_idf(tfs, term_idf)
    term_tf_idf = sorted(term_tf_idf.items(), key=lambda d:d[1], reverse=True)
    print(term_tf_idf)

References

[1] 数学之美，吴军，人民邮电出版社
[2] 信息检索导论， Christopher D. Manning，人民邮电出版社

查看全文

相关阅读:
ECS内网穿透
 设置服务器ssh会话时间
 VScode插件
 Linux拷贝U盘文件（命令行）
打开IDM下载视频时弹出防火墙阻止下载，解决方案
 如何将jmeter.bat命令文件固定到任务栏
 jmeter安装教程
 安装JDK8.0（JDK1.8） & 环境变量配置 & idea中配置java路径
 教你不用任何第三方软件实现任务栏居中
 [Unity优化]gc03：代码优化

原文地址：https://www.cnblogs.com/fengyubo/p/7069443.html