zoukankan      html  css  js  c++  java
  • java利用pdfbox处理pdf

    刚开始以为java读取pdf向读取txt文件一样简单,图样图森普!乱码问题!

    Game Starts

    参考文档

      1) http://pdfbox.apache.org/cookbook/documentcreation.html

    依赖jar包

      1)pdfbox-app-1.8.6.jar http://pdfbox.apache.org/downloads.html#recent

    What's Up

    Lucene怎么对pdf做索引呢?转成txt吗?

    Lucene Integration

    Document luceneDocument = LucenePDFDocument.getDocument( ... );

    Alway Be Coding

    Create a blank PDF

    This small sample shows how to create a new PDF document using PDFBox.

     1 // Create a new empty document
     2 PDDocument document = new PDDocument();
     3 
     4 // Create a new blank page and add it to the document
     5 PDPage blankPage = new PDPage();
     6 document.addPage( blankPage );
     7 
     8 // Save the newly created document
     9 document.save("BlankPage.pdf");
    10 
    11 // finally make sure that the document is properly
    12 // closed.
    13 document.close();

    Hello World using a PDF base font

    This small sample shows how to create a new document and print the text "Hello World" using one of the PDF base fonts.

    // Create a document and add a page to it
    PDDocument document = new PDDocument();
    PDPage page = new PDPage();
    document.addPage( page );
    
    // Create a new font object selecting one of the PDF base fonts
    PDFont font = PDType1Font.HELVETICA_BOLD;
    
    // Start a new content stream which will "hold" the to be created content
    PDPageContentStream contentStream = new PDPageContentStream(document, page);
    
    // Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
    contentStream.beginText();
    contentStream.setFont( font, 12 );
    contentStream.moveTextPositionByAmount( 100, 700 );//注意这个坐标,(0,0)为本页的左下角
    contentStream.drawString( "Hello World" );
    contentStream.endText();
    
    // Make sure that the content stream is closed:
    contentStream.close();
    
    // Save the results and ensure that the document is properly closed:
    document.save( "Hello World.pdf");
    document.close();

    Read PDF

    下面是我参考网上的代码自己尝试的,官网没有具体例子介绍
    其实整个过程就是 加载Document(pdf文档) 利用IO流写入到TXT文件

     1 package tools;
     2 
     3 import java.io.File;
     4 import java.io.FileNotFoundException;
     5 import java.io.FileWriter;
     6 import java.io.IOException;
     7 import java.net.MalformedURLException;
     8 import java.net.URL;
     9 import org.apache.pdfbox.pdmodel.PDDocument;
    10 import org.apache.pdfbox.util.PDFTextStripper;
    11 
    12 public class PDFHandler {
    13     public static void readPDF(String pdfFile) {
    14         String txtFile = null;
    15         PDDocument doc = null;
    16         FileWriter writer = null;
    17         URL url = null;
    18         try {
    19             url = new URL(pdfFile); 
    20         } catch (MalformedURLException e) {
    21             //有异常说明无法转成url,以文件系统处理
    22             url = null;
    23         }
    24         
    25         if(url != null) {//url处理
    26             try {
    27                 doc = PDDocument.load(url);//加载文档
    28                 String fileName = url.getFile();
    29                 if(fileName.endsWith(".pdf")) { //得到新文件的文件名
    30                     File outFile = new File(fileName.replace(".pdf", ".txt"));
    31                     txtFile = outFile.getName(); 
    32                 } else {
    33                     return;
    34                 }
    35             } catch (IOException e) {
    36                 e.printStackTrace();
    37                 return;
    38             }
    39         } else {//文件系统处理
    40             try {
    41                 doc = PDDocument.load(pdfFile);
    42                 if(pdfFile.endsWith(".pdf")) {
    43                     txtFile = pdfFile.replace(".pdf", ".txt");
    44                 } else {
    45                     return;
    46                 }
    47             } catch (IOException e) {
    48                 e.printStackTrace();
    49                 return;
    50             }
    51         }
    52         try {
    53             writer = new FileWriter(txtFile);
    54             PDFTextStripper textStripper = new PDFTextStripper();//读取PDF到TXT中的操作类
    55             textStripper.setSortByPosition(false);//这个看了下官方说明,不是很确定是什么意思,但是为了提高效率最好设为false,缺省为false
    56             textStripper.setStartPage(1);//起始页,缺省为第一页
    57             textStripper.setEndPage(2);//结束页,缺省为最后一页
    58             textStripper.writeText(doc, writer);//最重要的一步,写入到txt
    59         } catch (FileNotFoundException e) {
    60             e.printStackTrace();
    61         } catch (IOException e) {
    62             e.printStackTrace();
    63         } finally {
    64             if(doc != null) {
    65                 try {
    66                     doc.close();
    67                 } catch (IOException e) {
    68                     e.printStackTrace();
    69                 }
    70             }
    71             if(writer!= null) {
    72                 try {
    73                     writer.close();
    74                 } catch (IOException e) {
    75                     e.printStackTrace();
    76                 }
    77             }
    78         }
    79     }
    80     public static void main(String[] args) {
    81         readPDF("resource/正则表达式.pdf");
    82     }
    83 }

     TO BE CONTINUED……

  • 相关阅读:
    flash播放器插件与flash播放器的区别
    FLASH动作脚本详解
    flash代码
    LitJson使用中需要注意的一些问题(转)
    AS3中ASCII码和字符互转函数
    JQuery直接调用asp.net后台WebMethod方法(转)
    C#文件操作
    js延迟执行
    js操作表格、table、
    定时任务、js定时任务
  • 原文地址:https://www.cnblogs.com/erbin/p/3893450.html
Copyright © 2011-2022 走看看