zoukankan      html  css  js  c++  java
  • Apache Tika源码研究(七)

    tika怎样加载Parser实现类的,怎样根据文档的mime类型调用相应的Parser实现类,本文接着分析

    先熟悉一下tika的解析类的相关接口和类的UML模型:

    Parser接口的源码如下:

    /**
     * Tika parser interface.
     */
    public interface Parser extends Serializable {
    
        /**
         * Returns the set of media types supported by this parser when used
         * with the given parse context.
         *
         * @since Apache Tika 0.7
         * @param context parse context
         * @return immutable set of media types
         */
        Set<MediaType> getSupportedTypes(ParseContext context);
    
        /**
         * Parses a document stream into a sequence of XHTML SAX events.
         * Fills in related document metadata in the given metadata object.
         * <p>
         * The given document stream is consumed but not closed by this method.
         * The responsibility to close the stream remains on the caller.
         * <p>
         * Information about the parsing context can be passed in the context
         * parameter. See the parser implementations for the kinds of context
         * information they expect.
         *
         * @since Apache Tika 0.5
         * @param stream the document stream (input)
         * @param handler handler for the XHTML SAX events (output)
         * @param metadata document metadata (input and output)
         * @param context parse context
         * @throws IOException if the document stream could not be read
         * @throws SAXException if the SAX events could not be processed
         * @throws TikaException if the document could not be parsed
         */
        void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException;
    
    }

    该接口只提供void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法供其他类调用

    抽象类AbstractParser实现了Parser接口,其源码如下:

    /**
     * Abstract base class for new parsers. This method implements the old
     * deprecated parse method so subclasses won't have to.
     *
     * @since Apache Tika 0.10
     */
    public abstract class AbstractParser implements Parser {
    
        /**
         * Serial version UID.
         */
        private static final long serialVersionUID = 7186985395903074255L;
    
        /**
         * Calls the
         * {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)}
         * method with an empty {@link ParseContext}. This method exists as a
         * leftover from Tika 0.x when the three-argument parse() method still
         * existed in the {@link Parser} interface. No new code should call this
         * method anymore, it's only here for backwards compatibility.
         *
         * @deprecated use the {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)} method instead
         */
        public void parse(
                InputStream stream, ContentHandler handler, Metadata metadata)
                throws IOException, SAXException, TikaException {
            parse(stream, handler, metadata, new ParseContext());
        }
    
    }

    新增void parse(InputStream stream, ContentHandler handler, Metadata metadata)方法,提供模板方法功能

    下面接着贴出CompositeParser的源码,它继承自抽象类AbstractParser

    /**
     * Composite parser that delegates parsing tasks to a component parser
     * based on the declared content type of the incoming document. A fallback
     * parser is defined for cases where a parser for the given content type is
     * not available.
     */
    public class CompositeParser extends AbstractParser {
    
        /** Serial version UID */
        private static final long serialVersionUID = 2192845797749627824L;
    
        /**
         * Media type registry.
         */
        private MediaTypeRegistry registry;
    
        /**
         * List of component parsers.
         */
        private List<Parser> parsers;
    
        /**
         * The fallback parser, used when no better parser is available.
         */
        private Parser fallback = new EmptyParser();
    
        public CompositeParser(MediaTypeRegistry registry, List<Parser> parsers) {
            this.parsers = parsers;
            this.registry = registry;
        }
    
        public CompositeParser(MediaTypeRegistry registry, Parser... parsers) {
            this(registry, Arrays.asList(parsers));
        }
    
        public CompositeParser() {
            this(new MediaTypeRegistry());
        }
    
        public Map<MediaType, Parser> getParsers(ParseContext context) {
            Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
            for (Parser parser : parsers) {
                for (MediaType type : parser.getSupportedTypes(context)) {
                    map.put(registry.normalize(type), parser);
                }
            }
            return map;
        }
    
        /**
         * Utility method that goes through all the component parsers and finds
         * all media types for which more than one parser declares support. This
         * is useful in tracking down conflicting parser definitions.
         *
         * @since Apache Tika 0.10
         * @see <a href="https://issues.apache.org/jira/browse/TIKA-660">TIKA-660</a>
         * @param context parsing context
         * @return media types that are supported by at least two component parsers
         */
        public Map<MediaType, List<Parser>> findDuplicateParsers(
                ParseContext context) {
            Map<MediaType, Parser> types = new HashMap<MediaType, Parser>();
            Map<MediaType, List<Parser>> duplicates =
                new HashMap<MediaType, List<Parser>>();
            for (Parser parser : parsers) {
                for (MediaType type : parser.getSupportedTypes(context)) {
                    MediaType canonicalType = registry.normalize(type);
                    if (types.containsKey(canonicalType)) {
                        List<Parser> list = duplicates.get(canonicalType);
                        if (list == null) {
                            list = new ArrayList<Parser>();
                            list.add(types.get(canonicalType));
                            duplicates.put(canonicalType, list);
                        }
                        list.add(parser);
                    } else {
                        types.put(canonicalType, parser);
                    }
                }
            }
            return duplicates;
        }
    
        /**
         * Returns the media type registry used to infer type relationships.
         *
         * @since Apache Tika 0.8
         * @return media type registry
         */
        public MediaTypeRegistry getMediaTypeRegistry() {
            return registry;
        }
    
        /**
         * Sets the media type registry used to infer type relationships.
         *
         * @since Apache Tika 0.8
         * @param registry media type registry
         */
        public void setMediaTypeRegistry(MediaTypeRegistry registry) {
            this.registry = registry;
        }
    
        /**
         * Returns the component parsers.
         *
         * @return component parsers, keyed by media type
         */
        public Map<MediaType, Parser> getParsers() {
            return getParsers(new ParseContext());
        }
    
        /**
         * Sets the component parsers.
         *
         * @param parsers component parsers, keyed by media type
         */
        public void setParsers(Map<MediaType, Parser> parsers) {
            this.parsers = new ArrayList<Parser>(parsers.size());
            for (Map.Entry<MediaType, Parser> entry : parsers.entrySet()) {
                this.parsers.add(ParserDecorator.withTypes(
                        entry.getValue(), Collections.singleton(entry.getKey())));
            }
        }
    
        /**
         * Returns the fallback parser.
         *
         * @return fallback parser
         */
        public Parser getFallback() {
            return fallback;
        }
    
        /**
         * Sets the fallback parser.
         *
         * @param fallback fallback parser
         */
        public void setFallback(Parser fallback) {
            this.fallback = fallback;
        }
    
        /**
         * Returns the parser that best matches the given metadata. By default
         * looks for a parser that matches the content type metadata property,
         * and uses the fallback parser if a better match is not found. The
         * type hierarchy information included in the configured media type
         * registry is used when looking for a matching parser instance.
         * <p>
         * Subclasses can override this method to provide more accurate
         * parser resolution.
         *
         * @param metadata document metadata
         * @return matching parser
         */
        protected Parser getParser(Metadata metadata) {
            return getParser(metadata, new ParseContext());
        }
    
        protected Parser getParser(Metadata metadata, ParseContext context) {
            Map<MediaType, Parser> map = getParsers(context);
            MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));
            if (type != null) {
               // We always work on the normalised, canonical form
               type = registry.normalize(type);
            }
            
            while (type != null) {
                // Try finding a parser for the type
                Parser parser = map.get(type);
                if (parser != null) {
                    return parser;
                }
                
                // Failing that, try for the parent of the type
                type = registry.getSupertype(type);
            }
            return fallback;
        }
    
        public Set<MediaType> getSupportedTypes(ParseContext context) {
            return getParsers(context).keySet();
        }
    
        /**
         * Delegates the call to the matching component parser.
         * <p>
         * Potential {@link RuntimeException}s, {@link IOException}s and
         * {@link SAXException}s unrelated to the given input stream and content
         * handler are automatically wrapped into {@link TikaException}s to better
         * honor the {@link Parser} contract.
         */
        public void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException {
            Parser parser = getParser(metadata);
            TemporaryResources tmp = new TemporaryResources();
            try {
                TikaInputStream taggedStream = TikaInputStream.get(stream, tmp);
                TaggedContentHandler taggedHandler = new TaggedContentHandler(handler);
                try {
                    parser.parse(taggedStream, taggedHandler, metadata, context);
                } catch (RuntimeException e) {
                    throw new TikaException(
                            "Unexpected RuntimeException from " + parser, e);
                } catch (IOException e) {
                    taggedStream.throwIfCauseOf(e);
                    throw new TikaException(
                            "TIKA-198: Illegal IOException from " + parser, e);
                } catch (SAXException e) {
                    taggedHandler.throwIfCauseOf(e);
                    throw new TikaException(
                            "TIKA-237: Illegal SAXException from " + parser, e);
                }
            } finally {
                tmp.dispose();
            }
        }
    
    }

     该类的注释很清楚,相当于将解析任务委托给了其他的解析组件,而自身提供的parser方法供其他类调用

    且分析CompositeParser类是怎样将解析任务委托给其他解析组件的,关键是parser方法的这行代码 Parser parser = getParser(metadata);

    它调用了下面的方法:

        protected Parser getParser(Metadata metadata) {
            return getParser(metadata, new ParseContext());
        }
    
        protected Parser getParser(Metadata metadata, ParseContext context) {
            Map<MediaType, Parser> map = getParsers(context);
            MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));
            if (type != null) {
               // We always work on the normalised, canonical form
               type = registry.normalize(type);
            }
            
            while (type != null) {
                // Try finding a parser for the type
                Parser parser = map.get(type);
                if (parser != null) {
                    return parser;
                }
                
                // Failing that, try for the parent of the type
                type = registry.getSupertype(type);
            }
            return fallback;
        }

    执行流程是首先获取mime类型跟相应的Parser实现类的映射Map<MediaType, Parser> ,然后根据Metadata的Metadata.CONTENT_TYPE属性得到MediaType类型,最后从Map<MediaType, Parser>获取相应的Parser实现类

     上面的代码Map<MediaType, Parser> map = getParsers(context)是获取Map<MediaType, Parser>映射

    public Map<MediaType, Parser> getParsers(ParseContext context) {
            Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
            for (Parser parser : parsers) {
                for (MediaType type : parser.getSupportedTypes(context)) {
                    map.put(registry.normalize(type), parser);
                }
            }
            return map;
        }

    即根据构造方法初始化的List<Parser> parsers组件集合,这里注意的是如果该组件类集合中的成员之一为CompositeParser本身的类型,则该成员提供的可以支持的mime类型同时又来自于该成员的解析组件集合(这里也许是CompositeParser命名的原因,这里用到了Composite模式),我们可以看到它Set<MediaType> getSupportedTypes(ParseContext context)方法:

        public Set<MediaType> getSupportedTypes(ParseContext context) {
            return getParsers(context).keySet();
        }

     Composite模式的简要UML模型图如下:

    我们接下来分析DefaultParser的源码,该类继承自CompositeParser类,用于初始化CompositeParser类的相关成员变量

    /**
     * A composite parser based on all the {@link Parser} implementations
     * available through the
     * {@link javax.imageio.spi.ServiceRegistry service provider mechanism}.
     *
     * @since Apache Tika 0.8
     */
    public class DefaultParser extends CompositeParser {
    
        /** Serial version UID */
        private static final long serialVersionUID = 3612324825403757520L;
    
        /**
         * Finds all statically loadable parsers and sort the list by name,
         * rather than discovery order. CompositeParser takes the last
         * parser for any given media type, so put the Tika parsers first
         * so that non-Tika (user supplied) parsers can take precedence.
         *
         * @param loader service loader
         * @return ordered list of statically loadable parsers
         */
        private static List<Parser> getDefaultParsers(ServiceLoader loader) {
            List<Parser> parsers =
                    loader.loadStaticServiceProviders(Parser.class);
            Collections.sort(parsers, new Comparator<Parser>() {
                public int compare(Parser p1, Parser p2) {
                    String n1 = p1.getClass().getName();
                    String n2 = p2.getClass().getName();
                    boolean t1 = n1.startsWith("org.apache.tika.");
                    boolean t2 = n2.startsWith("org.apache.tika.");
                    if (t1 == t2) {
                        return n1.compareTo(n2);
                    } else if (t1) {
                        return -1;
                    } else {
                        return 1;
                    }
                }
            });
            return parsers;
        }
    
        private transient final ServiceLoader loader;
    
        public DefaultParser(MediaTypeRegistry registry, ServiceLoader loader) {
            super(registry, getDefaultParsers(loader));
            this.loader = loader;
        }
    
        public DefaultParser(MediaTypeRegistry registry, ClassLoader loader) {
            this(registry, new ServiceLoader(loader));
        }
    
        public DefaultParser(ClassLoader loader) {
            this(MediaTypeRegistry.getDefaultRegistry(), new ServiceLoader(loader));
        }
    
        public DefaultParser(MediaTypeRegistry registry) {
            this(registry, new ServiceLoader());
        }
    
        public DefaultParser() {
            this(MediaTypeRegistry.getDefaultRegistry());
        }
    
        @Override
        public Map<MediaType, Parser> getParsers(ParseContext context) {
            Map<MediaType, Parser> map = super.getParsers(context);
    
            if (loader != null) {
                // Add dynamic parser service (they always override static ones)
                MediaTypeRegistry registry = getMediaTypeRegistry();
                for (Parser parser
                        : loader.loadDynamicServiceProviders(Parser.class)) {
                    for (MediaType type : parser.getSupportedTypes(context)) {
                        map.put(registry.normalize(type), parser);
                    }
                }
            }
    
            return map;
        }
    
    }

    该类主要是为基类初始化MediaTypeRegistry registry成员与List<Parser> parsers成员,它本身并没有覆盖void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法,为的是执行基类的方法(根据mime类型执行具体parser类的parse方法)

    最后来分析AutoDetectParser类的源码,它也继承自CompositeParser类:

    public class AutoDetectParser extends CompositeParser {
    
        /** Serial version UID */
        private static final long serialVersionUID = 6110455808615143122L;
    
        /**
         * The type detector used by this parser to auto-detect the type
         * of a document.
         */
        private Detector detector; // always set in the constructor
    
        /**
         * Creates an auto-detecting parser instance using the default Tika
         * configuration.
         */
        public AutoDetectParser() {
            this(TikaConfig.getDefaultConfig());
        }
    
        public AutoDetectParser(Detector detector) {
            this(TikaConfig.getDefaultConfig());
            setDetector(detector);
        }
    
        /**
         * Creates an auto-detecting parser instance using the specified set of parser.
         * This allows one to create a Tika configuration where only a subset of the
         * available parsers have their 3rd party jars included, as otherwise the
         * use of the default TikaConfig will throw various "ClassNotFound" exceptions.
         * 
         * @param detector Detector to use
         * @param parsers
         */
        public AutoDetectParser(Parser...parsers) {
            this(new DefaultDetector(), parsers);
        }
    
        public AutoDetectParser(Detector detector, Parser...parsers) {
            super(MediaTypeRegistry.getDefaultRegistry(), parsers);
            setDetector(detector);
        }
    
        public AutoDetectParser(TikaConfig config) {
            super(config.getMediaTypeRegistry(), config.getParser());
            setDetector(config.getDetector());
        }
    
        /**
         * Returns the type detector used by this parser to auto-detect the type
         * of a document.
         *
         * @return type detector
         * @since Apache Tika 0.4
         */
        public Detector getDetector() {
            return detector;
        }
    
        /**
         * Sets the type detector used by this parser to auto-detect the type
         * of a document.
         *
         * @param detector type detector
         * @since Apache Tika 0.4
         */
        public void setDetector(Detector detector) {
            this.detector = detector;
        }
    
        public void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException {
            TemporaryResources tmp = new TemporaryResources();
            try {
                TikaInputStream tis = TikaInputStream.get(stream, tmp);
    
                // Automatically detect the MIME type of the document
                MediaType type = detector.detect(tis, metadata);
                metadata.set(Metadata.CONTENT_TYPE, type.toString());
    
                // TIKA-216: Zip bomb prevention
                SecureContentHandler sch = new SecureContentHandler(handler, tis);
                try {
                    // Parse the document
                    super.parse(tis, sch, metadata, context);
                } catch (SAXException e) {
                    // Convert zip bomb exceptions to TikaExceptions
                    sch.throwIfCauseOf(e);
                    throw e;
                }
            } finally {
                tmp.dispose();
            }
        }
    
        public void parse(
                InputStream stream, ContentHandler handler, Metadata metadata)
                throws IOException, SAXException, TikaException {
            ParseContext context = new ParseContext();
            context.set(Parser.class, this);
            parse(stream, handler, metadata, context);
        }
    
    }

     该类也初始化基类的MediaTypeRegistry registry成员与List<Parser> parsers成员,不过这里的List<Parser> parsers成员有TikaConfig类提供,后者默认提供的Parser实现类为DefaultParser

    它的void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法首先检测文档的mime类型,然后将解析处理委托给CompositeParser基类执行,自身对外提供接口

    所以整个解析的流程是

    一、AutoDetectParser的parse方法:首先检测文件的mime类型,然后将解析任务交给基类CompositeParser的parse方法

    二、AutoDetectParser的基类CompositeParser的parse方法:根据参数里面的mime类型获取解析类DefaultParser(支持所有已经注册的mime类型,由List<Parser> parsers成员提供)

    三、调用DefaultParser的parse方法(DefaultParser默认执行父类的parse方法),即基类CompositeParser的parse方法,根据参数里面的mime类型获取具体解析类

    四、最后执行具体解析类的parse方法

    这里第一次AutoDetectParser初始化基类CompositeParser的parser组件集合是DefaultParser,基类CompositeParser的parse方法委托给DefaultParser

    第二次DefaultParser初始化基类的CompositeParser的parser组件集合是具体的parser实现类集合,基类CompositeParser的parse方法委托给具体的parser实现类

    这里体现的是Composite模式的运用。

  • 相关阅读:
    (原)在ubuntu 中安装 swi prolog 和 简单的使用
    (转) 新手入门:C/C++中的结构体
    (转) eclipse debug (调试) 学习心得
    flutter: 根视图、根元素与根渲染
    eggjs的参数校验模块egg-validate的使用和进一步定制化升级
    Vue源码中用到的工具函数
    21 项优化 React App 性能的技术
    react-redux 的使用
    编写 Vue.js 组件前需要知道的 10 件事
    Flutter实现抽屉动画效果
  • 原文地址:https://www.cnblogs.com/chenying99/p/2952294.html
Copyright © 2011-2022 走看看