刚开始以为java读取pdf向读取txt文件一样简单,图样图森普!乱码问题!
Game Starts
参考文档
1) http://pdfbox.apache.org/cookbook/documentcreation.html
依赖jar包
1)pdfbox-app-1.8.6.jar http://pdfbox.apache.org/downloads.html#recent
What's Up
Lucene怎么对pdf做索引呢?转成txt吗?
Lucene Integration
Document luceneDocument = LucenePDFDocument.getDocument( ... );
Alway Be Coding
Create a blank PDF
This small sample shows how to create a new PDF document using PDFBox.
1 // Create a new empty document 2 PDDocument document = new PDDocument(); 3 4 // Create a new blank page and add it to the document 5 PDPage blankPage = new PDPage(); 6 document.addPage( blankPage ); 7 8 // Save the newly created document 9 document.save("BlankPage.pdf"); 10 11 // finally make sure that the document is properly 12 // closed. 13 document.close();
Hello World using a PDF base font
This small sample shows how to create a new document and print the text "Hello World" using one of the PDF base fonts.
// Create a document and add a page to it PDDocument document = new PDDocument(); PDPage page = new PDPage(); document.addPage( page ); // Create a new font object selecting one of the PDF base fonts PDFont font = PDType1Font.HELVETICA_BOLD; // Start a new content stream which will "hold" the to be created content PDPageContentStream contentStream = new PDPageContentStream(document, page); // Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World" contentStream.beginText(); contentStream.setFont( font, 12 ); contentStream.moveTextPositionByAmount( 100, 700 );//注意这个坐标,(0,0)为本页的左下角 contentStream.drawString( "Hello World" ); contentStream.endText(); // Make sure that the content stream is closed: contentStream.close(); // Save the results and ensure that the document is properly closed: document.save( "Hello World.pdf"); document.close();
Read PDF
下面是我参考网上的代码自己尝试的,官网没有具体例子介绍
其实整个过程就是 加载Document(pdf文档) 利用IO流写入到TXT文件
1 package tools; 2 3 import java.io.File; 4 import java.io.FileNotFoundException; 5 import java.io.FileWriter; 6 import java.io.IOException; 7 import java.net.MalformedURLException; 8 import java.net.URL; 9 import org.apache.pdfbox.pdmodel.PDDocument; 10 import org.apache.pdfbox.util.PDFTextStripper; 11 12 public class PDFHandler { 13 public static void readPDF(String pdfFile) { 14 String txtFile = null; 15 PDDocument doc = null; 16 FileWriter writer = null; 17 URL url = null; 18 try { 19 url = new URL(pdfFile); 20 } catch (MalformedURLException e) { 21 //有异常说明无法转成url,以文件系统处理 22 url = null; 23 } 24 25 if(url != null) {//url处理 26 try { 27 doc = PDDocument.load(url);//加载文档 28 String fileName = url.getFile(); 29 if(fileName.endsWith(".pdf")) { //得到新文件的文件名 30 File outFile = new File(fileName.replace(".pdf", ".txt")); 31 txtFile = outFile.getName(); 32 } else { 33 return; 34 } 35 } catch (IOException e) { 36 e.printStackTrace(); 37 return; 38 } 39 } else {//文件系统处理 40 try { 41 doc = PDDocument.load(pdfFile); 42 if(pdfFile.endsWith(".pdf")) { 43 txtFile = pdfFile.replace(".pdf", ".txt"); 44 } else { 45 return; 46 } 47 } catch (IOException e) { 48 e.printStackTrace(); 49 return; 50 } 51 } 52 try { 53 writer = new FileWriter(txtFile); 54 PDFTextStripper textStripper = new PDFTextStripper();//读取PDF到TXT中的操作类 55 textStripper.setSortByPosition(false);//这个看了下官方说明,不是很确定是什么意思,但是为了提高效率最好设为false,缺省为false 56 textStripper.setStartPage(1);//起始页,缺省为第一页 57 textStripper.setEndPage(2);//结束页,缺省为最后一页 58 textStripper.writeText(doc, writer);//最重要的一步,写入到txt 59 } catch (FileNotFoundException e) { 60 e.printStackTrace(); 61 } catch (IOException e) { 62 e.printStackTrace(); 63 } finally { 64 if(doc != null) { 65 try { 66 doc.close(); 67 } catch (IOException e) { 68 e.printStackTrace(); 69 } 70 } 71 if(writer!= null) { 72 try { 73 writer.close(); 74 } catch (IOException e) { 75 e.printStackTrace(); 76 } 77 } 78 } 79 } 80 public static void main(String[] args) { 81 readPDF("resource/正则表达式.pdf"); 82 } 83 }