zoukankan      html  css  js  c++  java
  • TF-IDF词频逆文档频率算法

    一.简介

      1.RF-IDF【term frequency-inverse document frequency】是一种用于检索与探究的常用加权技术。

      2.TF-IDF是一种统计方法,用于评估一个词对于一个文件集或一个语料库中的其中一个文件的重要程度。

      3.词的重要性随着它在文件中出现的次数的增加而增加,但同时也会随着它在语料库中出现的频率的升高而降低。

    二.词频

      指的是某一个给定的词语在一份给定的文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件【同一个词语在文件里可能会比短文件有更高的词频,而不管该词重要与否】。

      公式:

        

      ni,j:是该词在文件dj中出现的次数,而分母则是在文件dj中所有词出现的次数之和。

    三.逆文档频率

      是一个词普遍重要性的度量。某一个特定词的IDF可以由总文件数目除以包含该词语的文件数据,再将得到的商取对数得到。

      公式:

        

      |D|:语料库中的文件总数

      |{j:ti€dj}|:包含ti的文件总数

    四.TF-IDF

      公式:TF-IDF = TF * IDF

      特点:某一特定文件内的高频率词语,以及该词语在整个语料库中的低文件频率,可以产生高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。

      思想:如果某个词或短语在一篇文章中出现的频率TF高,并且在其它文章中很少出现,则认为此词或短语具有很好的类别区分能力,适合用来分类。

    五.代码实现

     1 package big.data.analyse.tfidf
     2 
     3 import org.apache.log4j.{Level, Logger}
     4 import org.apache.spark.sql.SparkSession
     5 
     6 /**
     7   * Created by zhen on 2019/05/28.
     8   */
     9 object TF_IDF {
    10   /**
    11     * 设置日志级别
    12     */
    13   Logger.getLogger("org").setLevel(Level.WARN)
    14   def main(args: Array[String]) {
    15     val spark = SparkSession
    16       .builder()
    17       .appName("TF_IDF")
    18       .master("local[2]")
    19       .config("spark.sql.warehouse.dir", "file:///D://warehouse").getOrCreate()
    20     val sc = spark.sparkContext
    21     /**
    22       * 计算TF
    23       */
    24     val tf = sc.textFile("src/big/data/analyse/tfidf/TF.txt")
    25       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " ")) // 数据清洗
    26       .flatMap(row => row.split(" ")) // 拆分
    27       .map(row => (row, 1.0))
    28       .reduceByKey(_+_)
    29 
    30     val tfSize = tf.map(row => row._2).sum() // 计算总词数
    31 
    32     val tfed = tf.map(row => (row._1, row._2 / tfSize.toDouble)) //求词频
    33     println("TF:")
    34     tfed.foreach(println)
    35 
    36     /**
    37       * 计算IDF
    38       */
    39     val idf_0 = tf.map(row => (row._1, 1.0))
    40     println("加载IDF1文件数据。。。")
    41     val idf_1 = sc.textFile("src/big/data/analyse/tfidf/IDF1.txt")
    42       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " "))
    43       .flatMap(row => row.split(" "))
    44       .map(row => (row, 1.0))
    45       .reduceByKey(_+_)
    46       .map(row => (row._1, 1.0))
    47 
    48     println("加载IDF2文件数据。。。")
    49     val idf_2 = sc.textFile("src/big/data/analyse/tfidf/IDF2.txt")
    50       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " "))
    51       .flatMap(row => row.split(" "))
    52       .map(row => (row, 1.0))
    53       .reduceByKey(_+_)
    54       .map(row => (row._1, 1.0))
    55 
    56     /**
    57       * 整合语料库数据
    58       */
    59     val idf = idf_0.union(idf_1).union(idf_2)
    60       .reduceByKey(_+_)
    61       .map(row => (row._1, 3 / row._2))
    62     println("IDF:")
    63     idf.foreach(println)
    64 
    65     /**
    66       * 关联TF和IDF,计算TF-IDF
    67       */
    68     println("TF-IDF:")
    69     tfed.join(idf).map(row => (row._1, (row._2._1 * row._2._2).formatted("%.4f")))
    70       .foreach(println)
    71   }
    72 }

    六.结果

    TF:
    (GraphX,0.011494252873563218)
    (are,0.011494252873563218)
    (learning,0.011494252873563218)
    (Python,0.011494252873563218)
    (provides,0.011494252873563218)
    (is,0.022988505747126436)
    (Please,0.011494252873563218)
    (higher-level,0.011494252873563218)
    (general,0.011494252873563218)
    (Security,0.034482758620689655)
    (R,0.011494252873563218)
    (fast,0.011494252873563218)
    (SQL,0.022988505747126436)
    (Apache,0.011494252873563218)
    (Java,0.011494252873563218)
    (data,0.011494252873563218)
    (attack,0.011494252873563218)
    (This,0.011494252873563218)
    (cluster,0.011494252873563218)
    (graph,0.011494252873563218)
    (execution,0.011494252873563218)
    (MLlib,0.011494252873563218)
    (Scala,0.011494252873563218)
    (computing,0.011494252873563218)
    (downloading,0.011494252873563218)
    (Streaming,0.011494252873563218)
    (supports,0.022988505747126436)
    (engine,0.011494252873563218)
    (set,0.011494252873563218)
    (running,0.011494252873563218)
    (Spark,0.08045977011494253)
    (you,0.011494252873563218)
    (Overview,0.011494252873563218)
    (general-purpose,0.011494252873563218)
    (rich,0.011494252873563218)
    (APIs,0.011494252873563218)
    (vulnerable,0.011494252873563218)
    (that,0.011494252873563218)
    (a,0.022988505747126436)
    (high-level,0.011494252873563218)
    (processing,0.022988505747126436)
    (OFF,0.011494252873563218)
    (before,0.011494252873563218)
    (including,0.011494252873563218)
    (could,0.011494252873563218)
    (optimized,0.011494252873563218)
    (in,0.022988505747126436)
    (to,0.011494252873563218)
    (see,0.011494252873563218)
    (graphs,0.011494252873563218)
    (of,0.011494252873563218)
    (also,0.011494252873563218)
    (by,0.022988505747126436)
    (structured,0.011494252873563218)
    (tools,0.011494252873563218)
    (It,0.022988505747126436)
    (for,0.034482758620689655)
    (mean,0.011494252873563218)
    (an,0.011494252873563218)
    (machine,0.011494252873563218)
    (and,0.06896551724137931)
    (system,0.011494252873563218)
    (default,0.022988505747126436)
    加载IDF1文件数据。。。
    加载IDF2文件数据。。。
    IDF:
    (running,1.5)
    (For,3.0)
    (visit,3.0)
    (The,3.0)
    (you,1.0)
    (website,1.5)
    (than,3.0)
    (7,3.0)
    (PATH,3.0)
    (that,1.0)
    (was,1.5)
    (a,1.0)
    (main,3.0)
    (old,3.0)
    (high-level,1.5)
    (be,1.5)
    (quick,3.0)
    (processing,1.5)
    (could,1.5)
    (all,3.0)
    (augmenting,3.0)
    (optimized,1.5)
    (Downloads,3.0)
    (follow,3.0)
    (applications,3.0)
    (classpath,3.0)
    (structured,1.5)
    (like,1.5)
    (along,3.0)
    (support,3.0)
    (Spark’s,1.5)
    (If,3.0)
    (but,3.0)
    (and,1.0)
    (reference,3.0)
    (1,3.0)
    (g,3.0)
    (system,1.5)
    (your,3.0)
    (10,3.0)
    (It’s,3.0)
    (are,1.0)
    (learning,1.5)
    (download,1.5)
    (its,3.0)
    (After,3.0)
    (Building,3.0)
    (can,1.5)
    (Security,1.5)
    (have,3.0)
    (runs,3.0)
    (6,3.0)
    (build,3.0)
    (0,1.5)
    (SQL,1.0)
    (with,1.5)
    (locally,3.0)
    (projects,3.0)
    (their,3.0)
    (Get,3.0)
    (UNIX-like,3.0)
    (This,1.0)
    (,1.5)
    (first,3.0)
    (documentation,3.0)
    (Since,3.0)
    (still,3.0)
    (Downloading,3.0)
    (packaged,3.0)
    (better,3.0)
    (However,3.0)
    (switch,3.0)
    (hood,3.0)
    (Linux,3.0)
    (Streaming,1.5)
    (supports,1.5)
    (PyPI,3.0)
    ((2,3.0)
    (vulnerable,1.5)
    (RDD,3.0)
    (Dataset,3.0)
    (package,3.0)
    (this,3.0)
    (under,3.0)
    (Python,1.0)
    (provides,1.0)
    (API,1.5)
    (higher-level,1.5)
    (introduction,3.0)
    (Apache,1.5)
    (will,1.5)
    (Java,1.0)
    (2,1.5)
    (data,1.5)
    (as,3.0)
    (YARN,3.0)
    (installed,3.0)
    (pointing,3.0)
    (optimizations,3.0)
    (get,3.0)
    (cluster,1.5)
    (tutorial,3.0)
    (graph,1.5)
    (easy,3.0)
    (execution,1.5)
    (MLlib,1.5)
    (We,3.0)
    (you’d,3.0)
    (supported,3.0)
    (downloading,1.5)
    (shell,3.0)
    (handful,3.0)
    (1+,3.0)
    (Users,3.0)
    (engine,1.5)
    (version,1.5)
    (11,3.0)
    (set,1.5)
    (performance,3.0)
    (rich,1.5)
    (systems,3.0)
    (replaced,3.0)
    (Spark,1.0)
    (project,3.0)
    (Overview,1.5)
    (APIs,1.5)
    (Mac,3.0)
    (or,1.5)
    (popular,3.0)
    (Support,3.0)
    (richer,3.0)
    (downloads,3.0)
    (OFF,1.5)
    (future,3.0)
    (detailed,3.0)
    (GraphX,1.5)
    (removed,3.0)
    (4,3.0)
    (installation,3.0)
    (Please,1.5)
    (is,1.0)
    (guide,3.0)
    (recommend,3.0)
    (R,1.5)
    (general,1.5)
    (JAVA_HOME,3.0)
    (fast,1.5)
    (include,3.0)
    (need,3.0)
    (one,3.0)
    (attack,1.5)
    (how,3.0)
    (uses,3.0)
    (compatible,3.0)
    (information,3.0)
    (we,3.0)
    (interactive,3.0)
    (—,3.0)
    (using,1.5)
    (Note,1.5)
    (7+/3,3.0)
    (java,3.0)
    (pre-packaged,3.0)
    (Scala,1.0)
    (any,1.5)
    (computing,1.5)
    (variable,3.0)
    (users,3.0)
    (from,1.5)
    (has,3.0)
    (won’t,3.0)
    (through,3.0)
    (at,3.0)
    (more,3.0)
    (3,3.0)
    (versions,3.0)
    (of,1.0)
    (tools,1.5)
    (8+,3.0)
    (by,1.0)
    (mean,1.5)
    (RDDs,3.0)
    ((e,3.0)
    (It,1.5)
    (for,1.0)
    (To,3.0)
    (were,3.0)
    (both,3.0)
    (an,1.0)
    (12,3.0)
    (which,3.0)
    (machine,1.5)
    (libraries,3.0)
    (introduce,3.0)
    (environment,3.0)
    ((in,3.0)
    (programming,3.0)
    (See,3.0)
    (use,1.5)
    (default,1.5)
    (the,1.5)
    (write,3.0)
    (highly,3.0)
    (release,3.0)
    (Resilient,3.0)
    (interface,3.0)
    (strongly-typed,3.0)
    (about,3.0)
    (run,3.0)
    (general-purpose,1.5)
    (5,3.0)
    (Distributed,3.0)
    (on,3.0)
    (You,3.0)
    (source,3.0)
    (Scala),3.0)
    (show,3.0)
    (then,3.0)
    (before,1.0)
    (including,1.5)
    (to,1.0)
    (in,1.0)
    (client,3.0)
    (see,1.5)
    (HDFS,1.5)
    (graphs,1.5)
    (Hadoop’s,3.0)
    (also,1.5)
    (“Hadoop,3.0)
    (binary,3.0)
    (x),3.0)
    (free”,3.0)
    (Maven,3.0)
    (coordinates,3.0)
    (Windows,3.0)
    (deprecated,3.0)
    (install,3.0)
    ((RDD),3.0)
    (4+,3.0)
    (page,3.0)
    (OS),3.0)
    (Hadoop,1.5)
    TF-IDF:
    (you,0.0115)
    (that,0.0115)
    (a,0.0230)
    (high-level,0.0172)
    (processing,0.0345)
    (could,0.0172)
    (optimized,0.0172)
    (structured,0.0172)
    (and,0.0690)
    (system,0.0172)
    (are,0.0115)
    (learning,0.0172)
    (Security,0.0517)
    (SQL,0.0230)
    (This,0.0115)
    (Streaming,0.0172)
    (supports,0.0345)
    (vulnerable,0.0172)
    (Spark,0.0805)
    (Overview,0.0172)
    (APIs,0.0172)
    (OFF,0.0172)
    (of,0.0115)
    (tools,0.0172)
    (by,0.0230)
    (mean,0.0172)
    (It,0.0345)
    (for,0.0345)
    (an,0.0115)
    (machine,0.0172)
    (default,0.0345)
    (Python,0.0115)
    (provides,0.0115)
    (higher-level,0.0172)
    (Apache,0.0172)
    (GraphX,0.0172)
    (Please,0.0172)
    (is,0.0230)
    (R,0.0172)
    (general,0.0172)
    (fast,0.0172)
    (attack,0.0172)
    (Java,0.0115)
    (Scala,0.0115)
    (computing,0.0172)
    (data,0.0172)
    (cluster,0.0172)
    (graph,0.0172)
    (execution,0.0172)
    (MLlib,0.0172)
    (downloading,0.0172)
    (engine,0.0172)
    (set,0.0172)
    (rich,0.0172)
    (general-purpose,0.0172)
    (before,0.0115)
    (including,0.0172)
    (to,0.0115)
    (in,0.0230)
    (see,0.0172)
    (graphs,0.0172)
    (also,0.0172)
    
    Process finished with exit code 0
  • 相关阅读:
    如何使用dom拼接xml字符串(标准方式)
    javascript默认将数字类型的“002,00123”,作为整数,去掉前面的0
    java学习小记
    如何将div排成一行显示(默认垂直显示)
    【转】JDBC调用存储过程之实例讲解
    数组求和算法系列
    《12个球问题》分析
    C#类在什么时候分配内存
    C++请不要问我string s=”a”+”b”分配了几次内存
    算法两道百度笔试题
  • 原文地址:https://www.cnblogs.com/yszd/p/10939583.html
Copyright © 2011-2022 走看看