zoukankan      html  css  js  c++  java
  • java抽取word,pdf的四种武器

    1。用jacob.
        其实jacob是一个bridage,连接java和com或者win32函数的一个中间件,jacob并不能直接抽取word,excel等文件,需要自己写dll哦,不过已经有为你写好的了,就是jacob的作者一并提供了。
       jacob下载:http://www.matrix.org.cn/down_view.asp?id=13
        下载了jacob并放到指定的路径之后(dll放到path,jar文件放到classpath),就可以写你自己的抽取程序了,下面是一个例子:
    import java.io.File;
    import com.jacob.com.*;
    import com.jacob.activeX.*;

    public class FileExtracter{

    public static void main(String[] args) {

      ActiveXComponent app = new ActiveXComponent("Word.Application");
      String inFile = "c:\\test.doc";
    String tpFile = "c:\\temp.htm";
      String otFile = "c:\\temp.xml";
      boolean flag = false;
      try {
       app.setProperty("Visible", new Variant(false));
       Object docs = app.getProperty("Documents").toDispatch();
       Object doc = Dispatch.invoke(docs,"Open", Dispatch.Method, new Object[]{inFile,new Variant(false), new Variant(true)}, new int[1]).toDispatch();
       Dispatch.invoke(doc,"SaveAs", Dispatch.Method, new Object[]{tpFile,new Variant(8)}, new int[1]);
       Variant f = new Variant(false);
       Dispatch.call(doc, "Close", f);
       flag = true;
      } catch (Exception e) {
       e.printStackTrace();
      } finally {
       app.invoke("Quit", new Variant[] {});
      }

    }
    }
        2。用apache的poi来抽取word,excel。
        poi是apache的一个项目,不过就算用poi你可能都觉得很烦,不过不要紧,这里提供了更加简单的一个接口给你:
        下载经过封装后的poi包:http://www.matrix.org.cn/down_view.asp?id=14
        下载之后,放到你的classpath就可以了,下面是如何使用它的一个例子:
       import java.io.*;
    import  org.textmining.text.extraction.WordExtractor;
    /**
    * <p>Title: pdf extraction</p>
    * <p>Description: email:chris@matrix.org.cn</p>
    * <p>Copyright: Matrix Copyright (c) 2003</p>
    * <p>Company: Matrix.org.cn</p>
    * @author chris
    * @version 1.0,who use this example pls remain the declare
    */

    public class PdfExtractor {
      public PdfExtractor() {
      }
      public static void main(String args[]) throws Exception
      {
      FileInputStream in = new FileInputStream ("c:\\a.doc");
      WordExtractor extractor = new WordExtractor();
      String str = extractor.extractText(in);
      System.out.println("the result length is"+str.length());
       System.out.println("the result is"+str);
    }
    }
        
       3。pdfbox-用来抽取pdf文件
       但是pdfbox对中文支持还不好,先下载pdfbox:http://www.matrix.org.cn/down_view.asp?id=12
    下面是一个如何使用pdfbox抽取pdf文件的例子:
    import org.pdfbox.pdmodel.PDDocument;
    import org.pdfbox.pdfparser.PDFParser;
    import java.io.*;
    import org.pdfbox.util.PDFTextStripper;
    import java.util.Date;
    /**
    * <p>Title: pdf extraction</p>
    * <p>Description: email:chris@matrix.org.cn</p>
    * <p>Copyright: Matrix Copyright (c) 2003</p>
    * <p>Company: Matrix.org.cn</p>
    * @author chris
    * @version 1.0,who use this example pls remain the declare
    */

    public class PdfExtracter{

    public PdfExtracter(){
      }
    public String GetTextFromPdf(String filename) throws Exception
      {
      String temp=null;
      PDDocument pdfdocument=null;
      FileInputStream is=new FileInputStream(filename);
      PDFParser parser = new PDFParser( is );
      parser.parse();
      pdfdocument = parser.getPDDocument();
      ByteArrayOutputStream out = new ByteArrayOutputStream();
      OutputStreamWriter writer = new OutputStreamWriter( out );
      PDFTextStripper stripper = new PDFTextStripper();
      stripper.writeText(pdfdocument.getDocument(), writer );
      writer.close();
      byte[] contents = out.toByteArray();

      String ts=new String(contents);
      System.out.println("the string length is"+contents.length+"\n");
      return ts;
    }
    public static void main(String args[])
    {
    PdfExtracter pf=new PdfExtracter();
    PDDocument pdfDocument = null;

    try{
    String ts=pf.GetTextFromPdf("c:\\a.pdf");
    System.out.println(ts);
    }
    catch(Exception e)
      {
      e.printStackTrace();
      }
    }

    }

         4.抽取支持中文的pdf文件-xpdf
    xpdf是一个开源项目,我们可以调用他的本地方法来实现抽取中文pdf文件。
    下载xpdf函数包:http://www.matrix.org.cn/down_view.asp?id=15
    同时需要下载支持中文的补丁包:http://www.matrix.org.cn/down_view.asp?id=16
    按照readme放好中文的patch,就可以开始写调用本地方法的java程序了
    下面是一个如何调用的例子:
    import java.io.*;
    /**
    * <p>Title: pdf extraction</p>
    * <p>Description: email:chris@matrix.org.cn</p>
    * <p>Copyright: Matrix Copyright (c) 2003</p>
    * <p>Company: Matrix.org.cn</p>
    * @author chris
    * @version 1.0,who use this example pls remain the declare
    */


    public class PdfWin {
      public PdfWin() {
      }
      public static void main(String args[]) throws Exception
      {
        String PATH_TO_XPDF="C:\\Program Files\\xpdf\\pdftotext.exe";
        String filename="c:\\a.pdf";
        String[] cmd = new String[] { PATH_TO_XPDF, "-enc", "UTF-8", "-q", filename, "-"};
        Process p = Runtime.getRuntime().exec(cmd);
        BufferedInputStream bis = new BufferedInputStream(p.getInputStream());
        InputStreamReader reader = new InputStreamReader(bis, "UTF-8");
        StringWriter out = new StringWriter();
        char [] buf = new char[10000];
        int len;
        while((len = reader.read(buf))>= 0) {
        //out.write(buf, 0, len);
        System.out.println("the length is"+len);
        }
        reader.close();
        String ts=new String(buf);
        System.out.println("the str is"+ts);
      }
    }
  • 相关阅读:
    LeetCode Flatten Binary Tree to Linked List
    LeetCode Longest Common Prefix
    LeetCode Trapping Rain Water
    LeetCode Add Binary
    LeetCode Subsets
    LeetCode Palindrome Number
    LeetCode Count and Say
    LeetCode Valid Parentheses
    LeetCode Length of Last Word
    LeetCode Minimum Depth of Binary Tree
  • 原文地址:https://www.cnblogs.com/lee/p/770997.html
Copyright © 2011-2022 走看看