zoukankan      html  css  js  c++  java
  • 利用JAVA计算TFIDF和Cosine相似度-学习版本

    写在前面的话,既然是学习版本,那么就不是一个好用的工程实现版本,整套代码全部使用List进行匹配效率可想而知。

    【原文转自】:http://computergodzilla.blogspot.com/2013/07/how-to-calculate-tf-idf-of-document.html,修改了其中一些bug。

    P.S:如果不是被迫需要语言统一,尽量不要使用此工程计算TF-IDF,计算2W条短文本,Matlab实现仅是几秒之间,此Java工程要计算良久。。半个小时?甚至更久,因此此程序作为一个学习版本,并不适用于工程实现。。工程试验版本

    For beginners doing a project in text mining aches them a lot by various term like :

    • TF-IDF
    • COSINE SIMILARITY
    • CLUSTERING
    • DOCUMENT VECTORS

    In my earlier post I showed you guys what is Cosine Similarity. I will not talk about Cosine Similarity in this post but rather I will show a nice little code to calculate Cosine Similarity in java.

    Many of you must be familiar with Tf-Idf(Term frequency-Inverse Document Frequency).
    I will enlighten them in brief.

    Term Frequency:
    Suppose for a document “Tf-Idf Brief Introduction” there are overall 60000 words and a word Term-Frequency occurs 60times.
    Then , mathematically, its Term Frequency, TF = 60/60000 =0.001.

    Inverse Document Frequency:
    Suppose one bought Harry-Potter series, all series. Suppose there are 7 series and a word “AbraKaDabra” comes in 2 of the series.
    Then, mathematically, its Inverse-Document Frequency , IDF = 1 + log(7/2) = …….(calculated it guys, don’t be lazy, I am lazy not you guys.)

    And Finally, TFIDF = TF * IDF;

    By mathematically I assume you now know its meaning physically.

    Document Vector:
    There are various ways to calculate document vectors. I am just giving you an example. Suppose If I calculate all the term’s TF-IDF of a document A and store them in an array(list, matrix … in any ordered way, .. you guys are genius you know how to create a vector. ) then I get an Document Vector of TF-IDF scores of document A.

    The class shown below calculates the Term Frequency(TF) and Inverse Document Frequency(IDF).

    1. //TfIdf.java   
    2. package com.computergodzilla.tfidf;   
    3.   
    4. import java.util.List;   
    5.   
    6. /**
    7.  * Class to calculate TfIdf of term.  
    8.  * @author Mubin Shrestha  
    9.  */  
    10. public class TfIdf {   
    11.        
    12.     /**
    13.      * Calculates the tf of term termToCheck  
    14.      * @param totalterms : Array of all the words under processing document  
    15.      * @param termToCheck : term of which tf is to be calculated.  
    16.      * @return tf(term frequency) of term termToCheck  
    17.      */  
    18.     public double tfCalculator(String[] totalterms, String termToCheck) {   
    19.         double count = 0;  //to count the overall occurrence of the term termToCheck   
    20.         for (String s : totalterms) {   
    21.             if (s.equalsIgnoreCase(termToCheck)) {   
    22.                 count++;   
    23.             }   
    24.         }   
    25.         return count / totalterms.length;   
    26.     }   
    27.   
    28.     /**
    29.      * Calculates idf of term termToCheck  
    30.      * @param allTerms : all the terms of all the documents  
    31.      * @param termToCheck  
    32.      * @return idf(inverse document frequency) score  
    33.      */  
    34.     public double idfCalculator(List<String[]> allTerms, String termToCheck) {   
    35.         double count = 0;   
    36.         for (String[] ss : allTerms) {   
    37.             for (String s : ss) {   
    38.                 if (s.equalsIgnoreCase(termToCheck)) {   
    39.                     count++;   
    40.                     break;   
    41.                 }   
    42.             }   
    43.         }   
    44.         return 1 + Math.log(allTerms.size() / count);   
    45.     }   
    46. }  

    The class shown below parsed the text documents and split them into tokens. This class will communicate with TfIdf.java class to calculated TfIdf. It also calls CosineSimilarity.java class to calculated the similarity between the passed documents.

    Code   ViewCopyPrint
    1. //DocumentParser.java   
    2.   
    3. package com.computergodzilla.tfidf;   
    4.   
    5. import java.io.BufferedReader;   
    6. import java.io.File;   
    7. import java.io.FileNotFoundException;   
    8. import java.io.FileReader;   
    9. import java.io.IOException;   
    10. import java.util.ArrayList;   
    11. import java.util.List;   
    12.   
    13. /**
    14.  * Class to read documents  
    15.  *  
    16.  * @author Mubin Shrestha  
    17.  */  
    18. public class DocumentParser {   
    19.   
    20.     //This variable will hold all terms of each document in an array.   
    21.     private List<String[]> termsDocsArray = new ArrayList<String[]>();   
    22.     private List<String> allTerms = new ArrayList<String>(); //to hold all terms   
    23.     private List<double[]> tfidfDocsVector = new ArrayList<double[]>();   
    24.   
    25.     /**
    26.      * Method to read files and store in array.  
    27.      * @param filePath : source file path  
    28.      * @throws FileNotFoundException  
    29.      * @throws IOException  
    30.      */  
    31.     public void parseFiles(String filePath) throws FileNotFoundException, IOException {   
    32.         File[] allfiles = new File(filePath).listFiles();   
    33.         BufferedReader in = null;   
    34.         for (File f : allfiles) {   
    35.             if (f.getName().endsWith(“.txt”)) {   
    36.                 in = new BufferedReader(new FileReader(f));   
    37.                 StringBuilder sb = new StringBuilder();   
    38.                 String s = null;   
    39.                 while ((s = in.readLine()) != null) {   
    40.                     sb.append(s);   
    41.                 }   
    42.                 String[] tokenizedTerms = sb.toString().replaceAll(“[\W&&[^\s]]”, “”).split(“\W+”);   //to get individual terms   
    43.                 for (String term : tokenizedTerms) {   
    44.                     if (!allTerms.contains(term)) {  //avoid duplicate entry   
    45.                         allTerms.add(term);   
    46.                     }   
    47.                 }   
    48.                 termsDocsArray.add(tokenizedTerms);   
    49.             }   
    50.         }   
    51.   
    52.     }   
    53.   
    54.     /**
    55.      * Method to create termVector according to its tfidf score.  
    56.      */  
    57.     public void tfIdfCalculator() {   
    58.         double tf; //term frequency   
    59.         double idf; //inverse document frequency   
    60.         double tfidf; //term requency inverse document frequency           
    61.         for (String[] docTermsArray : termsDocsArray) {   
    62.             double[] tfidfvectors = new double[allTerms.size()];   
    63.             int count = 0;   
    64.             for (String terms : allTerms) {   
    65.                 tf = new TfIdf().tfCalculator(docTermsArray, terms);   
    66.                 idf = new TfIdf().idfCalculator(termsDocsArray, terms);   
    67.                 tfidf = tf * idf;   
    68.                 tfidfvectors[count] = tfidf;   
    69.                 count++;   
    70.             }   
    71.             tfidfDocsVector.add(tfidfvectors);  //storing document vectors;               
    72.         }   
    73.     }   
    74.   
    75.     /**
    76.      * Method to calculate cosine similarity between all the documents.  
    77.      */  
    78.     public void getCosineSimilarity() {   
    79.         for (int i = 0; i < tfidfDocsVector.size(); i++) {   
    80.             for (int j = 0; j < tfidfDocsVector.size(); j++) {   
    81.                 System.out.println(“between ” + i + “ and ” + j + “  =  ”  
    82.                                    + new CosineSimilarity().cosineSimilarity   
    83.                                        (   
    84.                                          tfidfDocsVector.get(i),    
    85.                                          tfidfDocsVector.get(j)   
    86.                                        )   
    87.                                   );   
    88.             }   
    89.         }   
    90.     }   
    91. }  

    This is the class that calculates Cosine Similarity:

    Code   ViewCopyPrint
    1. //CosineSimilarity.java   
    2. /*
    3.  * To change this template, choose Tools | Templates  
    4.  * and open the template in the editor.  
    5.  */  
    6. package com.computergodzilla.tfidf;   
    7.   
    8. /**
    9.  * Cosine similarity calculator class  
    10.  * @author Mubin Shrestha  
    11.  */  
    12. public class CosineSimilarity {   
    13.   
    14.     /**
    15.      * Method to calculate cosine similarity between two documents.  
    16.      * @param docVector1 : document vector 1 (a)  
    17.      * @param docVector2 : document vector 2 (b)  
    18.      * @return   
    19.      */  
    20.     public double cosineSimilarity(double[] docVector1, double[] docVector2) {   
    21.         double dotProduct = 0.0;   
    22.         double magnitude1 = 0.0;   
    23.         double magnitude2 = 0.0;   
    24.         double cosineSimilarity = 0.0;   
    25.   
    26.         for (int i = 0; i < docVector1.length; i++) //docVector1 and docVector2 must be of same length   
    27.         {   
    28.             dotProduct += docVector1[i] * docVector2[i];  //a.b   
    29.             magnitude1 += Math.pow(docVector1[i], 2);  //(a^2)   
    30.             magnitude2 += Math.pow(docVector2[i], 2); //(b^2)   
    31.         }   
    32.   
    33.         magnitude1 = Math.sqrt(magnitude1);//sqrt(a^2)   
    34.         magnitude2 = Math.sqrt(magnitude2);//sqrt(b^2)   
    35.   
    36.         if (magnitude1 != 0.0 | magnitude2 != 0.0) {   
    37.             cosineSimilarity = dotProduct / (magnitude1 * magnitude2);   
    38.         } else {   
    39.             return 0.0;   
    40.         }   
    41.         return cosineSimilarity;   
    42.     }   
    43. }  

    Here’s the main class to run the code:

    Code   ViewCopyPrint
    1. //TfIdfMain.java   
    2. package com.computergodzilla.tfidf;   
    3.   
    4. import java.io.FileNotFoundException;   
    5. import java.io.IOException;   
    6.   
    7. /**
    8.  *  
    9.  * @author Mubin Shrestha  
    10.  */  
    11. public class TfIdfMain {   
    12.        
    13.     /**
    14.      * Main method  
    15.      * @param args  
    16.      * @throws FileNotFoundException  
    17.      * @throws IOException   
    18.      */  
    19.     public static void main(String args[]) throws FileNotFoundException, IOException   
    20.     {   
    21.         DocumentParser dp = new DocumentParser();   
    22.         dp.parseFiles(“D:\FolderToCalculateCosineSimilarityOf”); // give the location of source file   
    23.         dp.tfIdfCalculator(); //calculates tfidf   
    24.         dp.getCosineSimilarity(); //calculates cosine similarity      
    25.     }   
    26. }  

    You can also download the whole source code from here: Download. (Google Drive)

    Overall what I did is, I first calculate the TfIdf matrix of all the documents and then document vectors of each documents. Then I used those document vectors to calculate cosine similarity.

    You think clarification is not enough. Hit me..
    Happy Text-Mining!!

    from: http://jacoxu.com/?p=1619

  • 相关阅读:
    Rsync数据同步详情及配置
    ssh密钥及发放
    ssh服务及简单应用
    二叉搜索树的第k个结点
    序列化二叉树
    把二叉树打印成多行
    按之字形顺序打印二叉树
    对称的二叉树
    二叉树的下一个结点
    删除链表中重复的结点
  • 原文地址:https://www.cnblogs.com/GarfieldEr007/p/5342720.html
Copyright © 2011-2022 走看看