zoukankan      html  css  js  c++  java
  • Apache Tika源码研究(四)

    上文分析了具体的解析类HtmlParser对网页文档的解析实现源码,了解到了Apache Tika的编码识别的处理方式。

    (HtmlParser对网页文件的解析其实并没有用到ParseContext上下文类的SAXParser对象,而是用到了另外一个TagSoup组件)

    本文继续分析Tika对xml格式文件SAX解析的事件处理相关类,精彩的部分留在后头吧

    jaxp规范定义了四个事件处理接口,分别是EntityResolver, DTDHandler, ContentHandler, ErrorHandler

    并提供了一个默认处理类DefaultHandler,实现了上面四个接口,这为我们扩展SAX的事件处理类提供了方便,只要继承该类即可。

    Apache Tika提供的事件处理类使用了装饰模式,里面的包装关系一层套一层,实在是看得眼花缭乱,下面的解析部分只对部分类解析,其他事件处理类类似,不再赘述。

    先来看看关键类的UML模型

    ContentHandlerDecorator类继承自JAXP的默认处理类DefaultHandler,从名称基本可以看出该类采用了装饰模式,下面是它的源码:

    /**
     * Decorator base class for the {@link ContentHandler} interface. This class
     * simply delegates all SAX events calls to an underlying decorated handler
     * instance. Subclasses can provide extra decoration by overriding one or more
     * of the SAX event methods.
     */
    public class ContentHandlerDecorator extends DefaultHandler {
    
        /**
         * Decorated SAX event handler.
         */
        private ContentHandler handler;
    
        /**
         * Creates a decorator for the given SAX event handler.
         *
         * @param handler SAX event handler to be decorated
         */
        public ContentHandlerDecorator(ContentHandler handler) {
            assert handler != null;
            this.handler = handler;
        }
    
        /**
         * Creates a decorator that by default forwards incoming SAX events to
         * a dummy content handler that simply ignores all the events. Subclasses
         * should use the {@link #setContentHandler(ContentHandler)} method to
         * switch to a more usable underlying content handler.
         */
        protected ContentHandlerDecorator() {
            this(new DefaultHandler());
        }
    
        /**
         * Sets the underlying content handler. All future SAX events will be
         * directed to this handler instead of the one that was previously used.
         *
         * @param handler content handler
         */
        protected void setContentHandler(ContentHandler handler) {
            assert handler != null;
            this.handler = handler;
        }
    
        @Override
        public void startPrefixMapping(String prefix, String uri)
                throws SAXException {
            try {
                handler.startPrefixMapping(prefix, uri);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void endPrefixMapping(String prefix) throws SAXException {
            try {
                handler.endPrefixMapping(prefix);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void processingInstruction(String target, String data)
                throws SAXException {
            try {
                handler.processingInstruction(target, data);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void setDocumentLocator(Locator locator) {
            handler.setDocumentLocator(locator);
        }
    
        @Override
        public void startDocument() throws SAXException {
            try {
                handler.startDocument();
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void endDocument() throws SAXException {
            try {
                handler.endDocument();
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void startElement(
                String uri, String localName, String name, Attributes atts)
                throws SAXException {
            try {
                handler.startElement(uri, localName, name, atts);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void endElement(String uri, String localName, String name)
                throws SAXException {
            try {
                handler.endElement(uri, localName, name);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            try {
                handler.characters(ch, start, length);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void ignorableWhitespace(char[] ch, int start, int length)
                throws SAXException {
            try {
                handler.ignorableWhitespace(ch, start, length);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void skippedEntity(String name) throws SAXException {
            try {
                handler.skippedEntity(name);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public String toString() {
            return handler.toString();
        }
    
        /**
         * Handle any exceptions thrown by methods in this class. This method
         * provides a single place to implement custom exception handling. The
         * default behaviour is simply to re-throw the given exception, but
         * subclasses can also provide alternative ways of handling the situation.
         *
         * @param exception the exception that was thrown
         * @throws SAXException the exception (if any) thrown to the client
         */
        protected void handleException(SAXException exception) throws SAXException {
            throw exception;
        }
    
    }

    该装饰类持有ContentHandler对象的引用,其后相关的方法都是调用了ContentHandler的同名方法

    接下来看具体的装饰类BodyContentHandler的源码

    /**
     * Content handler decorator that only passes everything inside
     * the XHTML <body/> tag to the underlying handler. Note that
     * the &lt;body/&gt; tag itself is <em>not</em> passed on.
     */
    public class BodyContentHandler extends ContentHandlerDecorator {
    
        /**
         * XHTML XPath parser.
         */
        private static final XPathParser PARSER =
            new XPathParser("xhtml", XHTMLContentHandler.XHTML);
    
        /**
         * The XPath matcher used to select the XHTML body contents.
         */
        private static final Matcher MATCHER =
            PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");
    
        /**
         * Creates a content handler that passes all XHTML body events to the
         * given underlying content handler.
         *
         * @param handler content handler
         */
        public BodyContentHandler(ContentHandler handler) {
            super(new MatchingContentHandler(handler, MATCHER));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * the given writer.
         *
         * @param writer writer
         */
        public BodyContentHandler(Writer writer) {
            this(new WriteOutContentHandler(writer));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * the given output stream using the default encoding.
         *
         * @param stream output stream
         */
        public BodyContentHandler(OutputStream stream) {
            this(new WriteOutContentHandler(stream));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * an internal string buffer. The contents of the buffer can be retrieved
         * using the {@link #toString()} method.
         * <p>
         * The internal string buffer is bounded at the given number of characters.
         * If this write limit is reached, then a {@link SAXException} is thrown.
         *
         * @since Apache Tika 0.7
         * @param writeLimit maximum number of characters to include in the string,
         *                   or -1 to disable the write limit
         */
        public BodyContentHandler(int writeLimit) {
            this(new WriteOutContentHandler(writeLimit));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * an internal string buffer. The contents of the buffer can be retrieved
         * using the {@link #toString()} method.
         * <p>
         * The internal string buffer is bounded at 100k characters. If this write
         * limit is reached, then a {@link SAXException} is thrown.
         */
        public BodyContentHandler() {
            this(new WriteOutContentHandler());
        }
    
    }

    最后是用过调用父类的构造函数初始化被装饰的对象

  • 相关阅读:
    hdu 3006 The Number of set(思维+壮压DP)
    Mysql-SQL优化-统计某种类型的个数
    canvas.clipPath canvas.clipRect() 无效的原因
    linux下alias命令具体解释
    使用带粒子效果的 CAEmitterLayer
    Wordpress 建站(一)
    一个有趣的问题:ls -l显示的内容中total究竟是什么?
    (转)奇妙的数据挖掘
    android几个高速打包命令
    hdu3336解读KMP算法的next数组
  • 原文地址:https://www.cnblogs.com/chenying99/p/2949160.html
Copyright © 2011-2022 走看看