zoukankan      html  css  js  c++  java
  • Apache Tika源码研究(六)

    上文还没有来得及分析Apache Tika是怎样检测文档的mime类型的,以及怎样根据mime类型找到相应的Parser解析类的,下面接着说

    在tika-parsers.jar路径文件META-INF/services/org.apache.tika.detect.Detector记录了tika提供的mime类型检测类,当然tika还有部分mime类型检测类该文件并没有记录,后面我通过分析源码可以获知。

    该文件包含的检测类我们先睹为快:

    #  Licensed to the Apache Software Foundation (ASF) under one or more
    #  contributor license agreements.  See the NOTICE file distributed with
    #  this work for additional information regarding copyright ownership.
    #  The ASF licenses this file to You under the Apache License, Version 2.0
    #  (the "License"); you may not use this file except in compliance with
    #  the License.  You may obtain a copy of the License at
    #
    #       http://www.apache.org/licenses/LICENSE-2.0
    #
    #  Unless required by applicable law or agreed to in writing, software
    #  distributed under the License is distributed on an "AS IS" BASIS,
    #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    #  See the License for the specific language governing permissions and
    #  limitations under the License.
    
    org.apache.tika.parser.microsoft.POIFSContainerDetector
    org.apache.tika.parser.pkg.ZipContainerDetector

    注意还有vorbis-java-tika-X.jar的同名路径下也存在该文件,tika都会加载进来,所以共包含了三个实现类

    org.apache.tika.parser.microsoft.POIFSContainerDetector

    org.apache.tika.parser.pkg.ZipContainerDetector

    org.gagravarr.tika.OggDetector

    这些tika文档mime类型检测类共同实现了Detector接口:

    最重要的文件的mime类型检测相关接口和类的UML图如下:

    Detector接口源码:

    /**
     * Content type detector. Implementations of this interface use various
     * heuristics to detect the content type of a document based on given
     * input metadata or the first few bytes of the document stream.
     *
     * @since Apache Tika 0.3
     */
    public interface Detector extends Serializable {
    
        /**
         * Detects the content type of the given input document. Returns
         * <code>application/octet-stream</code> if the type of the document
         * can not be detected.
         * <p>
         * If the document input stream is not available, then the first
         * argument may be <code>null</code>. Otherwise the detector may
         * read bytes from the start of the stream to help in type detection.
         * The given stream is guaranteed to support the
         * {@link InputStream#markSupported() mark feature} and the detector
         * is expected to {@link InputStream#mark(int) mark} the stream before
         * reading any bytes from it, and to {@link InputStream#reset() reset}
         * the stream before returning. The stream must not be closed by the
         * detector.
         * <p>
         * The given input metadata is only read, not modified, by the detector.
         *
         * @param input document input stream, or <code>null</code>
         * @param metadata input metadata for the document
         * @return detected media type, or <code>application/octet-stream</code>
         * @throws IOException if the document input stream could not be read
         */
        MediaType detect(InputStream input, Metadata metadata) throws IOException;
    
    }

    实现该接口的最重要的类是CompositeDetector,该类并不提供具体的mime类型检测,而是调用其他的实现类进行mime类型检测,供tika其他类调用

    /**
     * Content type detector that combines multiple different detection mechanisms.
     */
    public class CompositeDetector implements Detector {
    
        /**
         * Serial version UID
         */
        private static final long serialVersionUID = 5980683158436430252L;
    
        private final MediaTypeRegistry registry;
    
        private final List<Detector> detectors;
    
        public CompositeDetector(
                MediaTypeRegistry registry, List<Detector> detectors) {
            this.registry = registry;
            this.detectors = detectors;
        }
    
        public CompositeDetector(List<Detector> detectors) {
            this(new MediaTypeRegistry(), detectors);
        }
    
        public CompositeDetector(Detector... detectors) {
            this(Arrays.asList(detectors));
        }
    
        public MediaType detect(InputStream input, Metadata metadata)
                throws IOException { 
            MediaType type = MediaType.OCTET_STREAM;
            for (Detector detector : getDetectors()) {
                MediaType detected = detector.detect(input, metadata);
                if (registry.isSpecializationOf(detected, type)) {
                    type = detected;
                }
            }
            return type;
        }
    
        /**
         * Returns the component detectors.
         */
        public List<Detector> getDetectors() {
           return Collections.unmodifiableList(detectors);
        }
    }

    构造函数CompositeDetector(MediaTypeRegistry registry, List<Detector> detectors)用于初始化成员变量MediaTypeRegistry registry和List<Detector> detectors

    MediaTypeRegistry registry成员注册了系统提供的mime类型,List<Detector> detectors成员为系统的Detector实现类集合

    MediaType detect(InputStream input, Metadata metadata)方法遍历Detector集合检测InputStream input的mime类型

    CompositeDetector还有一个派生类DefaultDetector,用于初始化CompositeDetector的成员变量

    public class DefaultDetector extends CompositeDetector {
    
        /** Serial version UID */
        private static final long serialVersionUID = -8170114575326908027L;
    
        /**
         * Finds all statically loadable detectors and sort the list by name,
         * rather than discovery order. Detectors are used in the given order,
         * so put the Tika parsers last so that non-Tika (user supplied)
         * parsers can take precedence.
         *
         * @param loader service loader
         * @return ordered list of statically loadable detectors
         */
        private static List<Detector> getDefaultDetectors(
                MimeTypes types, ServiceLoader loader) {
            List<Detector> detectors =
                    loader.loadStaticServiceProviders(Detector.class);
            Collections.sort(detectors, new Comparator<Detector>() {
                public int compare(Detector d1, Detector d2) {
                    String n1 = d1.getClass().getName();
                    String n2 = d2.getClass().getName();
                    boolean t1 = n1.startsWith("org.apache.tika.");
                    boolean t2 = n2.startsWith("org.apache.tika.");
                    if (t1 == t2) {
                        return n1.compareTo(n2);
                    } else if (t1) {
                        return 1;
                    } else {
                        return -1;
                    }
                }
            });
            // Finally the Tika MimeTypes as a fallback
            detectors.add(types);
            return detectors;
        }
    
        private transient final ServiceLoader loader;
    
        public DefaultDetector(MimeTypes types, ServiceLoader loader) {
            super(types.getMediaTypeRegistry(), getDefaultDetectors(types, loader));
            this.loader = loader;
        }
    
        public DefaultDetector(MimeTypes types, ClassLoader loader) {
            this(types, new ServiceLoader(loader));
        }
    
        public DefaultDetector(ClassLoader loader) {
            this(MimeTypes.getDefaultMimeTypes(), loader);
        }
    
        public DefaultDetector(MimeTypes types) {
            this(types, new ServiceLoader());
        }
    
        public DefaultDetector() {
            this(MimeTypes.getDefaultMimeTypes());
        }
    
        @Override
        public List<Detector> getDetectors() {
            if (loader != null) {
                List<Detector> detectors =
                        loader.loadDynamicServiceProviders(Detector.class);
                detectors.addAll(super.getDetectors());
                return detectors;
            } else {
                return super.getDetectors();
            }
        }
    
    }
    List<Detector> getDefaultDetectors(MimeTypes types, ServiceLoader loader)方法加载静态的Detector实现类,而List<Detector> getDetectors()方法加载动态的Detector实现类并包含父类的Detector实现类集合
    我们这里注意到,前者额外调用了detectors.add(types),将MimeTypes types对象也添加到集合里面,因为MimeTypes类是实现了Detector接口的,前面文章我已经提到过。
    所以实际用到的解析类包括四个

    org.apache.tika.parser.microsoft.POIFSContainerDetector

    org.apache.tika.parser.pkg.ZipContainerDetector

    org.gagravarr.tika.OggDetector

    org.apache.tika.mime.MimeTypes


    现在我们该如何调用呢,
    public static void main(String[] args) throws IOException {
            // TODO Auto-generated method stub
            ServiceLoader loader = new ServiceLoader();        
            MimeTypes mimeTypes = MimeTypes.getDefaultMimeTypes();        
            Detector detector=new DefaultDetector(mimeTypes, loader);       
    
            File file=new File("[文件路径]");
            InputStream stream = null;
            try
            {
                stream=new BufferedInputStream(new FileInputStream(file));            
                MediaType type =detector.detect(stream, new Metadata());
                System.out.println("mime类型:"+type.toString());
            }
            finally
            {
                if (stream != null)    stream.close();
            }
        }

     现在还有tika怎样加载Parser实现类的,怎样根据文档的mime类型调用相应的Parser实现类的还没有进行分析,不过这些都相对容易分析了,下文再继续吧。

  • 相关阅读:
    cf B. Sereja and Suffixes
    cf E. Dima and Magic Guitar
    cf D. Dima and Trap Graph
    cf C. Dima and Salad
    最短路径问题(floyd)
    Drainage Ditches(网络流(EK算法))
    图结构练习—BFSDFS—判断可达性(BFS)
    Sorting It All Out(拓扑排序)
    Power Network(最大流(EK算法))
    Labeling Balls(拓扑)
  • 原文地址:https://www.cnblogs.com/chenying99/p/2951092.html
Copyright © 2011-2022 走看看