zoukankan      html  css  js  c++  java
  • 【apache tika】apache tika获取文件内容(与FileUtils的对比)

      Tika支持多种功能:

          文档类型检测 内容提取 元数据提取 语言检测

    重要特点:

    • 统一解析器接口:Tika封装在一个单一的解析器接口的第三方解析器库。由于这个特征,用户逸出从选择合适的解析器库的负担,并使用它,根据所遇到的文件类型。

    • 低内存占用:Tika因此消耗更少的内存资源也很容易嵌入Java应用程序。也可以用Tika平台像移动那样PDA资源少,运行该应用程序。

    • 快速处理:从应用连结内容检测和提取可以预期的。

    • 灵活元数据:Tika理解所有这些都用来描述文件的元数据模型。

    • 解析器集成:Tika可以使用可在单一应用程序中每个文件类型的各种解析器库。

    • MIME类型检测: Tika可以检测并从所有包括在MIME标准的媒体类型中提取内容。

    • 语言检测: Tika包括语言识别功能,因此可以在一个多语种网站基于语言类型的文档中使用。

    使用Parser接口内容提取

    CompositeParser

      给出的图表显示Tika通用解析器类CompositeParser 主AutoDetectParser。由于CompositeParser类遵循复合设计模式,可以用一组解析器实例作为一个单独的解析器。CompositeParser类也可以访问所有实现解析器接口的类。

    AutoDetectParser

       这是CompositeParser的子类,它提供了自动类型检测。使用此功能,AutoDetectParser自动发送收到的文件到使用该复合方法适当分析器类。

     parse()方法

      除了parseToString(),还可以使用分析器接口的parse()方法。该方法的原型如下所示。

    void parse(
    InputStream stream, 
    ContentHandler handler, 
    Metadata metadata, 
    ParseContext context) 
    throws IOException, SAXException, TikaException

    我们简要解释方法参数:

    stream,从需要被解析文档创建的InputStream实例
    handler,接收从输入文档解析XHTML SAX事件序列的ContentHandler对象,负责处理事件并以特定的形式导出结果。
    metadata,元数据对象,它在解析器中传递元数据属性
    context,带有上下文相关信息的ParseContext实例,用于自定义解析过程。


    如果从输入流读取失败,则parse方法抛出IOException异常,从流中获取的文档不能被解析抛TikaException异常,处理器不能处理事件则抛SAXException异常。

    当解析文档时,Tika尽量重用已经存在的解析库,如Apache POI或PDFBox。因此,大多数解析器实现类仅适配这些外部类库。下面,我们将了解如何使用处理程序和元数据参数来提取文档的内容和元数据。为了方便,我们能使用Tika的门面类调用解析器Api。

    0.Tika的maven地址:

            <!--tika解析文本内容-->
            <dependency>
                <groupId>org.apache.tika</groupId>
                <artifactId>tika-parsers</artifactId>
                <version>1.17</version>
            </dependency>

    1.简单的用法

    1.1获取文件类型

      Tika支持MIME所提供的所有互联网媒体文件类型。

        /**
         * 检测文件类型的用法
         */
        public static void test1(){
            File file = new File("G:/tikatest/test.mp4");
    
            Tika tika = new Tika();
            String filetype = null;
            try {
                filetype = tika.detect(file);
            } catch (IOException e) {
                e.printStackTrace();
            }
            System.out.println(filetype);
        }

    结果:

    video/mp4

      我们将后缀去掉改为test也可以检测出同样的结果,其根据文件拓展名与文件内容检测。

    1.2提取Txt文本内容

    解析文件,一般用于Tika外观facade类的parseToString()方法。

        /**
         * 读取txt内容
         */
        public static void test2(){
            File file = new File("G:/tikatest/test.txt");
    
            Tika tika = new Tika();
            String filecontent = null;
            try {
                filecontent = tika.parseToString(file);
            } catch (IOException e) {
                e.printStackTrace();
            } catch (TikaException e) {
                e.printStackTrace();
            }
            System.out.println("Extracted Content: " + filecontent);
        }

    结果:

    Extracted Content: 111
    222
    333
    444
    555
    666

    补充:与之等价的FileUtils功能实现(commons-io包功能)

        public static void test3(){
            File file = new File("G:/tikatest/test.txt");
            String s = null;
            try {
                s = FileUtils.readFileToString(file);
            } catch (IOException e) {
                e.printStackTrace();
            }
            System.out.println(s);
        }

     1.3提取元数据

       元数据是什么,是文件所提供的附加信息。如果我们考虑一个音频文件,艺术家名,专辑名,标题下自带的元数据。

        public static void test4(){
            File file=new File("G:/tikatest/test.mp4");
    
            Parser parser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = null;
            try {
                inputstream = new FileInputStream(file);
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
            ParseContext context = new ParseContext();
    
            try {
                parser.parse(inputstream, handler, metadata, context);
            } catch (IOException e) {
                e.printStackTrace();
            } catch (SAXException e) {
                e.printStackTrace();
            } catch (TikaException e) {
                e.printStackTrace();
            }
            System.out.println(handler.toString());
    
            //getting the list of all meta data elements
            String[] metadataNames = metadata.names();
    
            for(String name : metadataNames) {
                System.out.println(name + ": " + metadata.get(name));
            }
        }

     结果:

    Software: OnePlus3-user 7.1.1 NMF26F 76 dev-keys
    GPS Altitude Ref: Unknown (2)
    Metering Mode: Center weighted average
    Model: ONEPLUS A3010
    meta:save-date: 2017-09-02T16:32:15
    File Name: apache-tika-4154811460990247864.tmp
    Exposure Mode: Auto exposure
    Exif Version: 2.20
    Sensing Method: One-chip color area sensor
    tiff:ImageLength: 540
    exif:Flash: false
    Creation-Date: 2017-09-02T16:32:15
    Interoperability Version: 1.00
    ISO Speed Ratings: 640
    X Resolution: 72 dots per inch
    Shutter Speed Value: 1/20 sec
    tiff:ImageWidth: 720
    Thumbnail Width Pixels: 0
    tiff:XResolution: 72.0
    Image Width: 720 pixels
    Last-Save-Date: 2017-09-02T16:32:15
    exif:FNumber: 2.0
    Number of Tables: 4 Huffman tables
    F-Number: f/2.0
    Color Space: sRGB
    meta:creation-date: 2017-09-02T16:32:15
    Resolution Units: inch
    Data Precision: 8 bits
    File Modified Date: 星期二 十月 16 22:15:54 +08:00 2018
    tiff:BitsPerSample: 8
    Last-Modified: 2017-09-02T16:32:15
    tiff:YResolution: 72.0
    YCbCr Positioning: Center of pixel array
    Compression Type: Baseline
    Components Configuration: YCbCr
    exif:IsoSpeedRatings: 640
    X-Parsed-By: org.apache.tika.parser.DefaultParser
    Focal Length 35: 28 mm
    modified: 2017-09-02T16:32:15
    Brightness Value: 0
    Thumbnail Offset: 874 bytes
    Exif Image Height: 3480 pixels
    Focal Length: 4.3 mm
    Thumbnail Length: 14211 bytes
    White Balance Mode: Auto white balance
    Content-Type: image/jpeg
    Make: OnePlus
    tiff:Make: OnePlus
    Date/Time Original: 2017:09:02 08:32:15
    Scene Capture Type: Standard
    Exif Image Width: 4640 pixels
    Makernote: [26 values]
    dcterms:created: 2017-09-02T16:32:15
    exif:ExposureTime: 0.05
    date: 2017-09-02T16:32:15
    Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
    Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
    Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
    tiff:ResolutionUnit: Inch
    Interoperability Index: Recommended Exif Interoperability Rules (ExifR98)
    Flash: Flash did not fire, auto
    Date/Time Digitized: 2017:09:02 08:32:15
    File Size: 50158 bytes
    Thumbnail Height Pixels: 0
    Resolution Unit: Inch
    Sub-Sec Time Original: 994455
    XMP Value Count: 4
    tiff:Software: OnePlus3-user 7.1.1 NMF26F 76 dev-keys
    Aperture Value: f/2.0
    Number of Components: 3
    dcterms:modified: 2017-09-02T16:32:15
    tiff:Model: ONEPLUS A3010
    Image Height: 540 pixels
    Sub-Sec Time Digitized: 994455
    Sub-Sec Time: 994455
    Scene Type: Directly photographed image
    Exposure Time: 0.05 sec
    exif:DateTimeOriginal: 2017-09-02T16:32:15
    exif:FocalLength: 4.26
    Compression: JPEG (old-style)
    FlashPix Version: 1.00
    Date/Time: 2017:09:02 08:32:15
    Exposure Program: Unknown (0)
    Y Resolution: 72 dots per inch

    1.4语言检测

    tika可以检测的18种语言:

        public static void test6(){
            //Instantiating a file object
            File file = new File("G:/tikatest/test.txt");
    
            //Parser method parameters
            Parser parser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream content = null;
            try {
                content = new FileInputStream(file);
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
    
            //Parsing the given document
            try {
                parser.parse(content, handler, metadata, new ParseContext());
            } catch (IOException e) {
                e.printStackTrace();
            } catch (SAXException e) {
                e.printStackTrace();
            } catch (TikaException e) {
                e.printStackTrace();
            }
    
            LanguageIdentifier object = new LanguageIdentifier(handler.toString());
            System.out.println("Language name :" + object.getLanguage());
        }

    结果:

    Language name :lt

     1.5提取PDF

       强大到可以提取里面的连接以及小标点符号。可以获取PDF的内容与元数据。

        public static void test7() throws IOException, TikaException, SAXException {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/4.pdf"));
            ParseContext pcontext = new ParseContext();
    
            //parsing the document using PDF parser
            PDFParser pdfparser = new PDFParser();
            pdfparser.parse(inputstream, handler, metadata,pcontext);
    
            //getting the content of the document
            System.out.println("Contents of the PDF :" + handler.toString());
    
            //getting metadata of the document
            System.out.println("Metadata of the PDF:");
            String[] metadataNames = metadata.names();
    
            for(String name : metadataNames) {
                System.out.println(name+ " : " + metadata.get(name));
            }
        }

    结果:

    Contents of the PDF :
    个人简历 
    ...............................
    
    
    Metadata of the PDF:
    access_permission:extract_for_accessibility : true
    pdf:docinfo:title : 个人简历
    meta:save-date : 2018-06-12T07:41:54Z
    pdf:docinfo:modified : 2018-06-12T07:41:54Z
    dcterms:created : 2018-06-12T07:41:54Z
    Author : liqiang qiao
    date : 2018-06-12T07:41:54Z
    access_permission:can_modify : true
    access_permission:modify_annotations : true
    creator : liqiang qiao
    Creation-Date : 2018-06-12T07:41:54Z
    title : 个人简历
    meta:author : liqiang qiao
    access_permission:fill_in_form : true
    created : Tue Jun 12 15:41:54 CST 2018
    pdf:docinfo:producer : Microsoft® Word 2013
    dc:format : application/pdf; version=1.5
    access_permission:can_print : true
    pdf:docinfo:created : 2018-06-12T07:41:54Z
    xmp:CreatorTool : Microsoft® Word 2013
    Last-Save-Date : 2018-06-12T07:41:54Z
    dc:title : 个人简历
    access_permission:assemble_document : true
    dcterms:modified : 2018-06-12T07:41:54Z
    meta:creation-date : 2018-06-12T07:41:54Z
    pdf:docinfo:creator : liqiang qiao
    dc:creator : liqiang qiao
    pdf:PDFVersion : 1.5
    Last-Modified : 2018-06-12T07:41:54Z
    modified : 2018-06-12T07:41:54Z
    xmpTPg:NPages : 2
    access_permission:can_print_degraded : true
    pdf:encrypted : false
    access_permission:extract_content : true
    producer : Microsoft® Word 2013
    pdf:docinfo:creator_tool : Microsoft® Word 2013
    Content-Type : application/pdf

    1.6提取MSOffice文档(读取word,excel)

        从Microsoft Office文档中提取内容和元数据。

        public static void test8() throws TikaException, SAXException, IOException {
            //detecting the file type
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/test.docx"));
            ParseContext pcontext = new ParseContext();
    
            //OOXml parser
            OOXMLParser msofficeparser = new OOXMLParser ();
            msofficeparser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();
    
            for(String name : metadataNames) {
                System.out.println(name + ": " + metadata.get(name));
            }
        }

     结果:

    Contents of the document: -Xms5200M -Xmx5200M -XX:PermSize=512M -XX:MaxPermSize=512M


    http_load使用教程: https://www.cnblogs.com/shijingjing07/p/6539179.html
    1.默认配置;
    内存

    线程数量:


    1.只修改JVM参数
    内存

    2.并发


    2.修改JVM与并发
    JVM

    并发


    Metadata of the document:
    cp:revision: 19
    meta:last-author: liqiang qiao
    Last-Author: liqiang qiao
    meta:save-date: 2017-12-14T10:25:00Z
    Application-Name: Microsoft Office Word
    Author: liqiang qiao
    dcterms:created: 2017-12-14T09:28:00Z
    Application-Version: 15.0000
    Character-Count-With-Spaces: 195
    date: 2017-12-14T10:25:00Z
    Total-Time: 57
    extended-properties:Template: Normal.dotm
    meta:line-count: 1
    creator: liqiang qiao
    publisher:
    Word-Count: 29
    meta:paragraph-count: 1
    Creation-Date: 2017-12-14T09:28:00Z
    extended-properties:AppVersion: 15.0000
    meta:author: liqiang qiao
    Line-Count: 1
    extended-properties:Application: Microsoft Office Word
    Paragraph-Count: 1
    Last-Save-Date: 2017-12-14T10:25:00Z
    Revision-Number: 19
    dcterms:modified: 2017-12-14T10:25:00Z
    meta:creation-date: 2017-12-14T09:28:00Z
    Template: Normal.dotm
    Page-Count: 1
    meta:character-count: 167
    dc:creator: liqiang qiao
    meta:word-count: 29
    Last-Modified: 2017-12-14T10:25:00Z
    extended-properties:Company:
    modified: 2017-12-14T10:25:00Z
    xmpTPg:NPages: 1
    extended-properties:TotalTime: 57
    dc:publisher:
    Character Count: 167
    meta:page-count: 1
    meta:character-count-with-spaces: 195
    Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document

    补充:tika读取Excel里面的内容

    例如一个Excel里面的内容:

        public static void test8() throws TikaException, SAXException, IOException {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/user.xlsx"));
            ParseContext pcontext = new ParseContext();
    
            OOXMLParser msofficeparser = new OOXMLParser ();
            msofficeparser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
        } 

    结果:

    Contents of the document:Sheet1
    序号 用户名字 用户电话 用户邮箱 用户账户 用户类型 密码
    1 rrrrrr 15888585954 954318308@qq.com root111 管理员 111222
    2 001 15898569856 qiao_liqiang@163.com 001 普通用户 111222
    3 超级管理员 15898569856 5555@qq.com root8 管理员 111222
    4 qqq 15898569856 qiao_liqiang@163.com 1231 普通用户 111222
    5 张三 18558458569 33335658@qq.com 333 普通用户 111222
    6 李四 15898569856 qiao_liqiang@163.com 4444 普通用户 111222
    7 超级管理员 15898569856 5555@qq.com root5 管理员 111222
    8 张三 18434391711 qiao_liqiang@163.com root7 管理员 111222
    9 张三 18434391711 qiao_liqiang@163.com root3 管理员 111222
    10 超管 15898569856 qiao_liqiang@163.com root 管理员 111222
    11 8888 15898569856 qiao_liqiang@163.com 8888 普通用户 111222
    12 超级管理员 15888585954 954318308@qq.com roo6 管理员 111222
    13 张三 18434391711 qiao_liqiang@163.com root4 管理员 111222

    1.7提取txt文档内容

        public static void test8() throws TikaException, SAXException, IOException {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/test.txt"));
            ParseContext pcontext = new ParseContext();
    
            TXTParser msofficeparser = new TXTParser();
            msofficeparser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();
    
            for(String name : metadataNames) {
                System.out.println(name + ": " + metadata.get(name));
            }
    
        }

    1.8获取html

      获取的是解析后的html,如果需要获取源码可以用IOUtils

        public static void test8() throws TikaException, SAXException, IOException {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/index.html"));
            ParseContext pcontext = new ParseContext();
    
            HtmlParser msofficeparser = new HtmlParser();
            msofficeparser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();
    
            for(String name : metadataNames) {
                System.out.println(name + ": " + metadata.get(name));
            }
        }

    html内容如下:

    结果:

    Contents of the document:
    Welcome to nginx!

    If you see this page, the nginx web server is successfully installed and
    working. Further configuration is required.


    For online documentation and support please refer to
    nginx.org.

    Commercial support is available at
    nginx.com.


    Thank you for using nginx.


    Metadata of the document:
    title: Welcome to nginx!
    Content-Encoding: ISO-8859-1
    Content-Type: text/html; charset=ISO-8859-1
    dc:title: Welcome to nginx!

    补充:FileUtils读取源码

        public static void test8() throws TikaException, SAXException, IOException {
            String s = FileUtils.readFileToString(new File("G:/tikatest/index.html"));
            System.out.println(s);
        }

    结果:

    <!DOCTYPE html>
    <html>
    <head>
    <title>Welcome to nginx!</title>
    <style>
        body {
             35em;
            margin: 0 auto;
            font-family: Tahoma, Verdana, Arial, sans-serif;
        }
    </style>
    </head>
    <body>
    <h1>Welcome to nginx!</h1>
    <p>If you see this page, the nginx web server is successfully installed and
    working. Further configuration is required.</p>
    
    <p>For online documentation and support please refer to
    <a href="http://nginx.org/">nginx.org</a>.<br/>
    Commercial support is available at
    <a href="http://nginx.com/">nginx.com</a>.</p>
    
    <p><em>Thank you for using nginx.</em></p>
    </body>
    </html>

    1.9获取Class文件--可以实现反编译的功能。

    反编译查看class文件内容:

    tika提取class内容:(可以获取类的方法摘要信息)

        public static void test8() throws TikaException, SAXException, IOException {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/UUIDUtil.class"));
            ParseContext pcontext = new ParseContext();
    
            ClassParser parser = new ClassParser();
            parser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();
    
            for(String name : metadataNames) {
                System.out.println(name + ": " + metadata.get(name));
            }
        }

     结果:

    Contents of the document:package cn.xm.jwxt.utils;
    public synchronized class UUIDUtil {
    public void UUIDUtil();
    public static String getUUID();
    public static String getUUID2();
    }


    Metadata of the document:
    title: UUIDUtil
    resourceName: UUIDUtil.class
    dc:title: UUIDUtil

      

    1.10获取Jar文件 

       可以提取jar内部的class文件的概述信息以及元信息.这个可以用于列出一个文件下的所有的class信息或者写一个工具类查找一个某个class是否在某个jar文件中。

        public static void test8() throws TikaException, SAXException, IOException {
            BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("G:/tikatest/t.jar"));
            ParseContext pcontext = new ParseContext();
    
            PackageParser parser = new PackageParser ();
            parser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();
    
            for(String name : metadataNames) {
                System.out.println(name + ": " + metadata.get(name));
            }
        }

    结果:

    ..........................

    org/apache/tika/utils/ServiceLoaderUtils.class
    package org.apache.tika.utils;
    public synchronized class ServiceLoaderUtils {
    public void ServiceLoaderUtils();
    public static void sortLoadedClasses(java.util.List);
    public static Object newInstance(String);
    public static Object newInstance(String, ClassLoader);
    }


    org/apache/tika/utils/XMLReaderUtils$1.class
    package org.apache.tika.utils;
    final synchronized class XMLReaderUtils$1 implements org.xml.sax.EntityResolver {
    void XMLReaderUtils$1();
    public org.xml.sax.InputSource resolveEntity(String, String) throws org.xml.sax.SAXException, java.io.IOException;
    }


    org/apache/tika/utils/XMLReaderUtils$2.class
    package org.apache.tika.utils;
    final synchronized class XMLReaderUtils$2 implements javax.xml.stream.XMLResolver {
    void XMLReaderUtils$2();
    public Object resolveEntity(String, String, String, String) throws javax.xml.stream.XMLStreamException;
    }


    org/apache/tika/utils/XMLReaderUtils.class
    package org.apache.tika.utils;
    public synchronized class XMLReaderUtils {
    private static final java.util.logging.Logger LOG;
    private static final org.xml.sax.EntityResolver IGNORING_SAX_ENTITY_RESOLVER;
    private static final javax.xml.stream.XMLResolver IGNORING_STAX_ENTITY_RESOLVER;
    public void XMLReaderUtils();
    public static org.xml.sax.XMLReader getXMLReader() throws org.apache.tika.exception.TikaException;
    public static javax.xml.parsers.SAXParser getSAXParser() throws org.apache.tika.exception.TikaException;
    public static javax.xml.parsers.SAXParserFactory getSAXParserFactory();
    public static javax.xml.parsers.DocumentBuilderFactory getDocumentBuilderFactory();
    public static javax.xml.parsers.DocumentBuilder getDocumentBuilder() throws org.apache.tika.exception.TikaException;
    public static javax.xml.stream.XMLInputFactory getXMLInputFactory();
    private static void trySetSAXFeature(javax.xml.parsers.DocumentBuilderFactory, String, boolean);
    private static void tryToSetStaxProperty(javax.xml.stream.XMLInputFactory, String, boolean);
    public static javax.xml.transform.Transformer getTransformer() throws org.apache.tika.exception.TikaException;
    static void <clinit>();
    }


    org/apache/tika/utils/package-info.class
    package org.apache.tika.utils;
    abstract interface package-info {
    }


    Metadata of the document:
    Content-Type: application/zip

    补充:在没设置BodyContentHandler参数的时候读取报错如下:

    Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
        at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
        at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)

    解决办法就是设置读取的参数:

    BodyContentHandler handler = new BodyContentHandler(10*1024*1024);

    1.11提取图像信息:

        public static void test8() throws TikaException, SAXException, IOException {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("g:/tikatest/5.jpeg"));
            ParseContext pcontext = new ParseContext();
            
            JpegParser  JpegParser = new JpegParser();
            JpegParser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();
            
            for(String name : metadataNames) {                 
               System.out.println(name + ": " + metadata.get(name));
            }
        }

    结果:

    Contents of the document:
    Metadata of the document:
    Number of Tables: 4 Huffman tables
    Number of Components: 3
    Image Height: 192 pixels
    Resolution Units: inch
    File Name: apache-tika-7234240523307196989.tmp
    Data Precision: 8 bits
    File Modified Date: 星期三 十月 17 21:43:39 +08:00 2018
    tiff:BitsPerSample: 8
    Compression Type: Baseline
    Component 1: Y component: Quantization table 0, Sampling factors 2 horiz/2 vert
    Component 2: Cb component: Quantization table 1, Sampling factors 1 horiz/1 vert
    tiff:ImageLength: 192
    Component 3: Cr component: Quantization table 1, Sampling factors 1 horiz/1 vert
    X Resolution: 96 dots
    File Size: 9216 bytes
    tiff:ImageWidth: 256
    Thumbnail Height Pixels: 0
    Thumbnail Width Pixels: 0
    Image Width: 256 pixels
    Y Resolution: 96 dots

    1.12提取Mp4信息

        public static void test8() throws TikaException, SAXException, IOException {
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            FileInputStream inputstream = new FileInputStream(new File("g:/tikatest/test.mp4"));
            ParseContext pcontext = new ParseContext();
            
            MP4Parser MP4Parser = new MP4Parser();
            MP4Parser.parse(inputstream, handler, metadata,pcontext);
            System.out.println("Contents of the document:" + handler.toString());
            System.out.println("Metadata of the document:");
            String[] metadataNames = metadata.names();
            
            for(String name : metadataNames) {                 
               System.out.println(name + ": " + metadata.get(name));
            }
        }

    结果:

    Contents of the document:
    Metadata of the document:
    dcterms:modified: 2017-07-20T10:25:23Z
    xmpDM:duration: 39.5
    meta:creation-date: 2017-07-20T10:25:23Z
    meta:save-date: 2017-07-20T10:25:23Z
    Last-Modified: 2017-07-20T10:25:23Z
    dcterms:created: 2017-07-20T10:25:23Z
    xmpDM:audioSampleRate: 10000
    date: 2017-07-20T10:25:23Z
    tiff:ImageLength: 578
    modified: 2017-07-20T10:25:23Z
    Creation-Date: 2017-07-20T10:25:23Z
    tiff:ImageWidth: 442
    Content-Type: video/mp4
    Last-Save-Date: 2017-07-20T10:25:23Z

    补充:有时候读取的文件内容太大的时候需要设置参数,如下:(用于读取大文件)

    BodyContentHandler handler = new BodyContentHandler(10*1024*1024);

     总结:

       至此,apacheTika基本用法已经使用完毕,tika不能获取word,pdf等文件中的图片。但是可以解析文件中的文字,常见文件的内容都是可以提取的。在某些场景下也是有用途的。比如做文件服务器的时候可以将内容提取出来保存到数据库或者保存到文件中,利用solr或者数据库的查询进行模糊搜索。

      tika在提取html、office等文件之后是提取里面的文字,有时候提取源码可以用FileUtils,最好两者结合使用。

      有时间可以用swing做一个基于apachetika查找文件内容和查找文件class的工具类,类似于everything,做的好一点比everything更好一点可以读取里面的内容。这只是一个思路。。。。。有时间再慢慢实现。

  • 相关阅读:
    HDU 1495 广度优先搜索
    oj 1792:迷宫 广搜和深搜
    oj 1756:八皇后 搜索
    OJ1700 八皇后问题 基本搜索算法
    PAT A1020
    PAT A1103
    PAT A1046 Shortest Distance
    PAT A1059
    PAT B1013
    二分查找
  • 原文地址:https://www.cnblogs.com/qlqwjy/p/9801364.html
Copyright © 2011-2022 走看看