zoukankan      html  css  js  c++  java
  • TF-IDF词项权重计算

    一、TF-IDF

    词项频率:

    df:term frequency。 term在文档中出现的频率.tf越大,词项越重要.

    文档频率:

    tf:document frequecy。有多少文档包括此term,df越大词项越不重要.

    词项权重计算公式:

    tf-idf=tf(t,d)*log(N/df(t))
    • W(t,d):the weight of the term in document d
    • tf(t,d):the frequency of term t in document d
    • N:the number of documents
    • df(t):the number of documents that contain term t

    二、JAVA实现

    package com.javacore.algorithm;
    
    import java.util.Arrays;
    import java.util.List;
    
    /**
     * Created by bee on 17/3/13.
     * @version 1.0
     * @author blog.csdn.net/napoay
     */
    public class TfIdfCal {
    
    
    
        /**
         *calculate the word frequency
         * @param doc word vector of a doc
         * @param term  a word
         * @return the word frequency of a doc
         */
        public double tf(List<String> doc, String term) {
    
            double termFrequency = 0;
            for (String str : doc) {
                if (str.equalsIgnoreCase(term)) {
                    termFrequency++;
                }
            }
            return termFrequency / doc.size();
        }
    
    
        /**
         *calculate the document frequency
         * @param docs the set of all docs
         * @param term a word
         * @return the number of docs which contain the word
         */
    
        public int df(List<List<String>> docs, String term) {
            int n = 0;
            if (term != null && term != "") {
    
                for (List<String> doc : docs) {
                    for (String word : doc) {
                        if (term.equalsIgnoreCase(word)) {
                            n++;
                            break;
                        }
                    }
                }
            } else {
                System.out.println("term不能为null或者空串");
            }
    
            return n;
        }
    
    
        /**
         *calculate the inverse document frequency
         * @param docs  the set of all docs
         * @param term  a word
         * @return  idf
         */
    
        public double idf(List<List<String>> docs, String term) {
    
            System.out.println("N:"+docs.size());
            System.out.println("DF:"+df(docs,term));
            return  Math.log(docs.size()/(double)df(docs,term));
        }
    
    
        /**
         * calculate tf-idf
         * @param doc a doc
         * @param docs document set
         * @param term a word
         * @return inverse document frequency
         */
        public double tfIdf(List<String> doc, List<List<String>> docs, String term) {
    
            return tf(doc, term) * idf(docs, term);
        }
    
    
        public static void main(String[] args) {
    
            List<String> doc1 = Arrays.asList("人工", "智能", "成为", "互联网", "大会", "焦点");
            List<String> doc2 = Arrays.asList("谷歌", "推出", "开源", "人工", "智能", "系统", "工具");
            List<String> doc3 = Arrays.asList("互联网", "的", "未来", "在", "人工", "智能");
            List<String> doc4 = Arrays.asList("谷歌", "开源", "机器", "学习", "工具");
            List<List<String>> documents = Arrays.asList(doc1, doc2, doc3,doc4);
    
    
            TfIdfCal calculator = new TfIdfCal();
    
            System.out.println(calculator.tf(doc2, "开源"));
            System.out.println(calculator.df(documents, "开源"));
            double tfidf = calculator.tfIdf(doc2, documents, "谷歌");
            System.out.println("TF-IDF (谷歌) = " + tfidf);
            System.out.println(Math.log(4/2)*1.0/7);
    
        }
    
    
    }
    

    执行结果:

    0.14285714285714285
    2
    N:4
    DF:2
    TF-IDF (谷歌) = 0.09902102579427789
  • 相关阅读:
    Caffe安装配置
    Ubuntu安装VMware Tools
    Ubuntu 快速下载
    09_树莓派驱动USB摄像头
    06_树莓派制作路由器
    06_树莓派搭建私有云owncloud
    05_树莓派图片定时上传到服务器
    04_远程获取树莓派文件
    03_远程监控树莓派摄像头
    js面向对象,多种创建对象方法!(转载)
  • 原文地址:https://www.cnblogs.com/liguangsunls/p/7356357.html
Copyright © 2011-2022 走看看