zoukankan      html  css  js  c++  java
  • tika提取pdf信息异常

    org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
    at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
    at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
    at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
    at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
    at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)
    at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:398)
    at org.apache.pdfbox.util.PDFTextStripper.writeString(PDFTextStripper.java:866)
    at org.apache.pdfbox.util.PDFTextStripper.writeLine(PDFTextStripper.java:1896)
    at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:744)
    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:461)
    at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
    

    在使用apache tika提取pdf信息时,报以上错误。根据错误信息提示,可能读取超过请求限制(10万字)。

    我的代码如下:

    		Parser parser = new PDFParser();
    		//parser.
    		BodyContentHandler handler = new BodyContentHandler();
    		Metadata metadata = new Metadata();
    		InputStream stream = null;
    		try {
    			
    			stream = new FileInputStream(new File("1.pdf"));
    			parser.parse(stream, handler, metadata, new ParseContext());
    			
    			 for (String name : metadata.names()) {
                     System.out.println(name + ":	" + metadata.get(name));
                 }
    		} catch (IOException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		} catch (SAXException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		} catch (TikaException e) {
    			// TODO Auto-generated catch block
    			e.printStackTrace();
    		} finally {
    			try {
    				stream.close();
    			} catch (IOException e) {
    				// TODO Auto-generated catch block
    				e.printStackTrace();
    			}
    		}
    

      对读取字数限制,可能在某个构造函数里我没有传入最大限制,而使用了默认的十万字。检查一下上面的代码,我注意到了

    BodyContentHandler的构造函数:
    org.apache.tika.sax.BodyContentHandler.BodyContentHandler(int writeLimit)
    

      看样子有关系。修改一下构造函数的数字为:10*1024*1024(这个数字有pdf文档大小决定)。

    重新调试程序,即可获得pdf的元数据信息如下:

      

    dc:subject:	
    meta:save-date:	2014-07-22T21:02:38Z
    subject:	PostgreSQL 9.3 Documentation
    Author:	The PostgreSQL Global Development Group
    dcterms:created:	2014-07-22T20:55:33Z
    date:	2014-07-22T21:02:38Z
    creator:	The PostgreSQL Global Development Group
    Creation-Date:	2014-07-22T20:55:33Z
    title:	PostgreSQL 9.3 Documentation
    trapped:	False
    meta:author:	The PostgreSQL Global Development Group
    created:	Wed Jul 23 04:55:33 CST 2014
    meta:keyword:	
    cp:subject:	PostgreSQL 9.3 Documentation
    dc:format:	application/pdf; version=1.4
    PTEX.Fullbanner:	This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012/Debian) kpathsea version 6.1.0
    xmp:CreatorTool:	LaTeX with hyperref package
    Keywords:	
    dc:title:	PostgreSQL 9.3 Documentation
    Last-Save-Date:	2014-07-22T21:02:38Z
    meta:creation-date:	2014-07-22T20:55:33Z
    dcterms:modified:	2014-07-22T21:02:38Z
    dc:creator:	The PostgreSQL Global Development Group
    pdf:PDFVersion:	1.4
    Last-Modified:	2014-07-22T21:02:38Z
    modified:	2014-07-22T21:02:38Z
    xmpTPg:NPages:	2861
    pdf:encrypted:	false
    producer:	pdfTeX-1.40.13; modified using iText® 5.1.3 ©2000-2011 1T3XT BVBA
    Content-Type:	application/pdf
    

      

  • 相关阅读:
    JAVA导出EXCEL表格
    解决springboot配置@ControllerAdvice不能捕获 NoHandlerFoundException问题
    Mysql 查看定时器 打开定时器 设置定时器时间
    IDEA @Autowired 出现红色下划线 报红
    IntelliJ IDEA报warn class is never used
    UML类图符号 各种关系说明以及举例
    提升单元测试体验的利器--Mockito使用总结
    maven2中snapshot快照库和release发布库的应用
    Maven最佳实践-distributionManagement
    访问GitLab的PostgreSQL数据库
  • 原文地址:https://www.cnblogs.com/likehua/p/4082830.html
Copyright © 2011-2022 走看看