Converting PDF to Text in C#

zoukankan html css js c++ java

Converting PDF to Text in C#
Parsing PDF files in .NET using PDFBox and IKVM.NET (managed code).
Download source files - 82 kB [codeproject.com]

Download full project including all dependencies [squarepdf.net]

Update

April 20, 2015: The article and the Visual Studio project are updated and work with the latest PDFBox version (1.8.9). It's also possible to download the project with all dependencies (resolving the dependencies proved to be a bit tricky).

February 27, 2014: This article originally described parsing PDF files using PDFBox. It has been extended to include samples for IFilter and iTextSharp.

How to Parse PDF Files

There are several main methods for extracting text from PDF files in .NET:

Microsoft IFilter interface and Adobe IFilter implementation.

iTextSharp

PDFBox

None of these PDF parsing solutions is perfect. We will discuss all these methods below.

1. Parsing PDF using Adobe PDF IFilter

In order to parse PDF files using IFilter interface you need the following:

Windows 2000 or later

Adobe Acrobat or Reader 7.0.5+ (or the standalone Adobe PDF IFilter [adobe.com])

IFilter COM wrapper class [dotlucene.net]

Sample code:

Hide   Copy Code

using IFilter; // ... public static string ExtractTextFromPdf(string path) { return DefaultParser.Extract(path); }

Download a sample project:

Parsing PDF Files using IFilter [squarepdf.net]

If you are using the PDF IFilter that comes with Adobe Acrobat Reader you will need to rename the process to "filtdump.exe" otherwise the IFilter interface will return E_NOTIMPL error code. See more at Parsing PDF Files using IFilter [squarepdf.net].

Disadvantages:

Using unreliable COM interop that handles IFilter interface (and the combination of IFilter COM and Adobe PDF IFilter is especially troublesome).

A separate installation of Adobe IFilter on the target system. This can be painful if you need to distribute your indexing solution to someone else.

You have to use "filtdump.exe" file name for your application with the latest PDF IFilter implementation that comes with Acrobat Reader.

2. Parsing PDF using iTextSharp

iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but it supports extracting text from PDF as well.

Sample code:

Hide   Copy Code

using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path) { using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++) { text.Append(PdfTextExtractor.GetTextFromPage(reader, i)); } return text.ToString(); } }

Credit: Member 10364982

Download a sample project:

Parsing PDF Files using iTextSharp [squarepdf.net]

You may consider using LocationTextExtractionStrategy to get better precision.

Hide   Copy Code

public static string ExtractTextFromPdf(string path) { ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i++) { string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its); string[] theLines = thePage.Split(' '); foreach (var theLine in theLines) { text.AppendLine(theLine); } } return text.ToString(); } }

Credit: Member 10140900

Disadvantages of iTextSharp:

Licensing if you are not happy with AGPL license

3. Parsing PDF using PDFBox

PDFBox is another Java PDF library. It is also ready to be used with the original Java Lucene (see LucenePDFDocument).

Fortunately, there is a .NET version of PDFBox that is created using IKVM.NET (just download the PDFBox package).

Using PDFBox in .NET requires adding references to:

IKVM.OpenJDK.Core.dll

IKVM.OpenJDK.SwingAWT.dll

pdfbox-1.8.9.dll

and copying the following files the bin directory:

commons-logging.dll

fontbox-1.8.9.dll

IKVM.OpenJDK.Text.dll

IKVM.OpenJDK.Util.dll

IKVM.Runtime.dll

Using the PDFBox to parse PDFs is fairly easy:

Hide   Copy Code

using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path) { PDDocument doc = null; try { doc = PDDocument.load(path) PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } }

Download a sample project:

How to convert PDF files to text in C# (.NET) [squarepdf.net]

How to convert PDF file to text in VB (.NET) [squarepdf.net]

The size of the required assemblies adds up to almost 18 MB:

IKVM.OpenJDK.Core.dll (4 MB)

IKVM.OpenJDK.SwingAWT.dll (6 MB)

pdfbox-1.8.9.dll (4 MB)

commons-logging.dll (82 kB)

fontbox-1.8.9.dll (180 kB)

IKVM.OpenJDK.Text.dll (800 kB)

IKVM.OpenJDK.Util.dll (2 MB)

IKVM.Runtime.dll (1 MB)

The speed is not so bad: Parsing the U.S. Copyright Act PDF (5.1 MB) took about 13 seconds.

Thanks to bobrien100 for improvements suggestions.

Disadvantages:

IKVM.NET Dependencies (18 MB)

Speed (especially the IKVM.NET warm-up time)

Related information

See this article (with future updates) at SquarePDF.NET.

History

April 20, 2015 - Updated to work with the latest PDFBox release (1.8.9)

November 27, 2014 - Updated to work with the latest PDFBox release (1.8.7)

March 10, 2014 - IFilter file name limitations added, iTextSharp sample extended

February 27, 2014 - Samples for IFilter and iTextSharp added.

February 24, 2014 - Updated to work with the latest PDFBox release (1.8.4)

June 20, 2012 - Updated to work with the latest PDFBox release (1.7.0)
查看全文

相关阅读:
linux基础知识
 linux运维指令
 redis的三种集群方式
 docker安装tomcat
cetos7.7安装docker
Redis和MySQL数据一致中出现的几种情况
 外行人都能看懂的 Spring Cloud，错过了血亏！
交换机和猫、路由器到底有什么区别
 一次给女朋友转账引发我对分布式事务的思考
 sql merge

原文地址：https://www.cnblogs.com/Javi/p/9116779.html

Converting PDF to Text in C#

Update

1. Parsing PDF using Adobe PDF IFilter

2. Parsing PDF using iTextSharp

3. Parsing PDF using PDFBox

Related information

History