zoukankan      html  css  js  c++  java
  • Apache Tika源码研究(三)

    上文我们基本知道Tika是通过SAXParser来解析XHTML文档的,下面我通过一个具体的解析类HtmlParser入手,来看看网页文件的解析过程。

    首先看看HtmlParser类的继承层次,HtmlParser继承自抽象类AbstractParser,而AbstractParser实现了Parser接口

    Parser接口声明的方法方法如下:

    /**
     * Tika parser interface.
     */
    public interface Parser extends Serializable {
    
        /**
         * Returns the set of media types supported by this parser when used
         * with the given parse context.
         *
         * @since Apache Tika 0.7
         * @param context parse context
         * @return immutable set of media types
         */
        Set<MediaType> getSupportedTypes(ParseContext context);
    
        /**
         * Parses a document stream into a sequence of XHTML SAX events.
         * Fills in related document metadata in the given metadata object.
         * <p>
         * The given document stream is consumed but not closed by this method.
         * The responsibility to close the stream remains on the caller.
         * <p>
         * Information about the parsing context can be passed in the context
         * parameter. See the parser implementations for the kinds of context
         * information they expect.
         *
         * @since Apache Tika 0.5
         * @param stream the document stream (input)
         * @param handler handler for the XHTML SAX events (output)
         * @param metadata document metadata (input and output)
         * @param context parse context
         * @throws IOException if the document stream could not be read
         * @throws SAXException if the SAX events could not be processed
         * @throws TikaException if the document could not be parsed
         */
        void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException;
    
    }

    第一个方法返回支持的媒体类型集合

    第二个方法为正式的解析方法

    抽象类AbstractParser只对上面接口的void parse()方法进行了一层包装,类似于模板方法,方便其他类调用,其代码如下:

    public abstract class AbstractParser implements Parser {
    
        /**
         * Serial version UID.
         */
        private static final long serialVersionUID = 7186985395903074255L;
    
        /**
         * Calls the
         * {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)}
         * method with an empty {@link ParseContext}. This method exists as a
         * leftover from Tika 0.x when the three-argument parse() method still
         * existed in the {@link Parser} interface. No new code should call this
         * method anymore, it's only here for backwards compatibility.
         *
         * @deprecated use the {@link Parser#parse(InputStream, ContentHandler, Metadata, ParseContext)} method instead
         */
        public void parse(
                InputStream stream, ContentHandler handler, Metadata metadata)
                throws IOException, SAXException, TikaException {
            parse(stream, handler, metadata, new ParseContext());
        }
    
    }

    下面来分析HtmlParser类的关键部分,HtmlParser的部分源码如下:

    **
     * HTML parser. Uses TagSoup to turn the input document to HTML SAX events,
     * and post-processes the events to produce XHTML and metadata expected by
     * Tika clients.
     */
    public class HtmlParser extends AbstractParser {
    
        /** Serial version UID */
        private static final long serialVersionUID = 7895315240498733128L;
    
        private static final Set<MediaType> SUPPORTED_TYPES =
            Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
                    MediaType.text("html"),
                    MediaType.application("xhtml+xml"),
                    MediaType.application("vnd.wap.xhtml+xml"),
                    MediaType.application("x-asp"))));
    
        private static final ServiceLoader LOADER =
                new ServiceLoader(HtmlParser.class.getClassLoader());
    
        /**
         * HTML schema singleton used to amortise the heavy instantiation time.
         */
        private static final Schema HTML_SCHEMA = new HTMLSchema();
    
        public Set<MediaType> getSupportedTypes(ParseContext context) {
            return SUPPORTED_TYPES;
        }
    
        public void parse(
                InputStream stream, ContentHandler handler,
                Metadata metadata, ParseContext context)
                throws IOException, SAXException, TikaException {
            // Automatically detect the character encoding
            AutoDetectReader reader = new AutoDetectReader(
                    new CloseShieldInputStream(stream), metadata, LOADER);
            try {
                Charset charset = reader.getCharset();
                String previous = metadata.get(Metadata.CONTENT_TYPE);
                if (previous == null || previous.startsWith("text/html")) {
                    MediaType type = new MediaType(MediaType.TEXT_HTML, charset);
                    metadata.set(Metadata.CONTENT_TYPE, type.toString());
                }
                // deprecated, see TIKA-431
                metadata.set(Metadata.CONTENT_ENCODING, charset.name());
    
                // Get the HTML mapper from the parse context
                HtmlMapper mapper =
                        context.get(HtmlMapper.class, new HtmlParserMapper());
    
                // Parse the HTML document
                org.ccil.cowan.tagsoup.Parser parser =
                        new org.ccil.cowan.tagsoup.Parser();
    
                // TIKA-528: Reuse share schema to avoid heavy instantiation
                parser.setProperty(
                        org.ccil.cowan.tagsoup.Parser.schemaProperty, HTML_SCHEMA);
                // TIKA-599: Shared schema is thread-safe only if bogons are ignored
                parser.setFeature(
                        org.ccil.cowan.tagsoup.Parser.ignoreBogonsFeature, true);
    
                parser.setContentHandler(new XHTMLDowngradeHandler(
                        new HtmlHandler(mapper, handler, metadata)));
    
                parser.parse(reader.asInputSource());
            } finally {
                reader.close();
            }
        }
    
      //其他方法略
    
    }

     该类的注释写得很清楚,这里用到了一个TagSoup组件,用来解析HTML的,转换为格式良好的XHTML结构

    Set<MediaType> getSupportedTypes(ParseContext context)方法返回支持的媒体类型集合
    void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)方法即为具体的解析HTML文档的方法

    编码识别类
    AutoDetectReader
    AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), metadata, LOADER);
    该类继承自BufferedReader,封装了输入流stream,AutoDetectReader类的源码如下:
    /**
     * An input stream reader that automatically detects the character encoding
     * to be used for converting bytes to characters.
     *
     * @since Apache Tika 1.2
     */
    public class AutoDetectReader extends BufferedReader {
    
        private static final ServiceLoader DEFAULT_LOADER =
                new ServiceLoader(AutoDetectReader.class.getClassLoader());
    
        private static Charset detect(
                InputStream input, Metadata metadata,
                List<EncodingDetector> detectors)
                throws IOException, TikaException {
            // Ask all given detectors for the character encoding
            for (EncodingDetector detector : detectors) {
                Charset charset = detector.detect(input, metadata);
                if (charset != null) {
                    return charset;
                }
            }
    
            // Try determining the encoding based on hints in document metadata
            MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE));
            if (type != null) {
                String charset = type.getParameters().get("charset");
                if (charset != null) {
                    try {
                        return CharsetUtils.forName(charset);
                    } catch (Exception e) {
                        // ignore
                    }
                }
            }
    
            throw new TikaException(
                    "Failed to detect the character encoding of a document");
        }
    
        private final Charset charset;
    
        private AutoDetectReader(InputStream stream, Charset charset)
                throws IOException {
            super(new InputStreamReader(stream, charset));
            this.charset = charset;
    
            // TIKA-240: Drop the BOM if present
            mark(1);
            if (read() != '\ufeff') { // zero-width no-break space
                reset();
            }
        }
    
        private AutoDetectReader(
                BufferedInputStream stream, Metadata metadata,
                List<EncodingDetector> detectors)
                throws IOException, TikaException {
            this(stream, detect(stream, metadata, detectors));
        }
    
        public AutoDetectReader(
                InputStream stream, Metadata metadata,
                ServiceLoader loader) throws IOException, TikaException {
            this(new BufferedInputStream(stream), metadata,
                    loader.loadServiceProviders(EncodingDetector.class));
        }
    
        public AutoDetectReader(InputStream stream, Metadata metadata)
                throws IOException, TikaException {
            this(new BufferedInputStream(stream), metadata, DEFAULT_LOADER);
        }
    
        public AutoDetectReader(InputStream stream)
                throws IOException, TikaException {
            this(stream, new Metadata());
        }
    
        public Charset getCharset() {
            return charset;
        }
    
        public InputSource asInputSource() {
            InputSource source = new InputSource(this);
            source.setEncoding(charset.name());
            return source;
        }
    
    }

     这里最关键的方法是

    static Charset detect(InputStream input, Metadata metadata,List<EncodingDetector> detectors)

    通过该方法获取文档的编码类型

    List<EncodingDetector>即为编码识别类的集合,源自loader.loadServiceProviders(EncodingDetector.class)方法,加载编码识别类列表
    接下来分析
    ServiceLoader类的源码:
    /**
     * Internal utility class that Tika uses to look up service providers.
     *
     * @since Apache Tika 0.9
     */
    public class ServiceLoader {
    
        /**
         * The default context class loader to use for all threads, or
         * <code>null</code> to automatically select the context class loader.
         */
        private static volatile ClassLoader contextClassLoader = null;
    
        /**
         * The dynamic set of services available in an OSGi environment.
         * Managed by the {@link TikaActivator} class and used as an additional
         * source of service instances in the {@link #loadServiceProviders(Class)}
         * method.
         */
        private static final Map<Object, Object> services =
                new HashMap<Object, Object>();
    
        /**
         * Returns the context class loader of the current thread. If such
         * a class loader is not available, then the loader of this class or
         * finally the system class loader is returned.
         *
         * @see <a href="https://issues.apache.org/jira/browse/TIKA-441">TIKA-441</a>
         * @return context class loader, or <code>null</code> if no loader
         *         is available
         */
        static ClassLoader getContextClassLoader() {
            ClassLoader loader = contextClassLoader;
            if (loader == null) {
                loader = ServiceLoader.class.getClassLoader();
            }
            if (loader == null) {
                loader = ClassLoader.getSystemClassLoader();
            }
            return loader;
        }
    
        /**
         * Sets the context class loader to use for all threads that access
         * this class. Used for example in an OSGi environment to avoid problems
         * with the default context class loader.
         *
         * @param loader default context class loader,
         *               or <code>null</code> to automatically pick the loader
         */
        public static void setContextClassLoader(ClassLoader loader) {
            contextClassLoader = loader;
        }
    
        static void addService(Object reference, Object service) {
            synchronized (services) {
                services.put(reference, service);
            }
        }
    
        static Object removeService(Object reference) {
            synchronized (services) {
                return services.remove(reference);
            }
        }
    
        private final ClassLoader loader;
    
        private final LoadErrorHandler handler;
    
        private final boolean dynamic;
    
        public ServiceLoader(
                ClassLoader loader, LoadErrorHandler handler, boolean dynamic) {
            this.loader = loader;
            this.handler = handler;
            this.dynamic = dynamic;
        }
    
        public ServiceLoader(ClassLoader loader, LoadErrorHandler handler) {
            this(loader, handler, false);
        }
    
        public ServiceLoader(ClassLoader loader) {
            this(loader, LoadErrorHandler.IGNORE);
        }
    
        public ServiceLoader() {
            this(getContextClassLoader(), LoadErrorHandler.IGNORE, true);
        }
    
        /**
         * Returns an input stream for reading the specified resource from the
         * configured class loader.
         *
         * @param name resource name
         * @return input stream, or <code>null</code> if the resource was not found
         * @see ClassLoader#getResourceAsStream(String)
         * @since Apache Tika 1.1
         */
        public InputStream getResourceAsStream(String name) {
            if (loader != null) {
                return loader.getResourceAsStream(name);
            } else {
                return null;
            }
        }
    
        /**
         * Loads and returns the named service class that's expected to implement
         * the given interface.
         *
         * @param iface service interface
         * @param name service class name
         * @return service class
         * @throws ClassNotFoundException if the service class can not be found
         *                                or does not implement the given interface
         * @see Class#forName(String, boolean, ClassLoader)
         * @since Apache Tika 1.1
         */
        @SuppressWarnings("unchecked")
        public <T> Class<? extends T> getServiceClass(Class<T> iface, String name)
                throws ClassNotFoundException {
            if (loader == null) {
                throw new ClassNotFoundException(
                        "Service class " + name + " is not available");
            }
            Class<?> klass = Class.forName(name, true, loader);
            if (klass.isInterface()) {
                throw new ClassNotFoundException(
                        "Service class " + name + " is an interface");
            } else if (!iface.isAssignableFrom(klass)) {
                throw new ClassNotFoundException(
                        "Service class " + name
                        + " does not implement " + iface.getName());
            } else {
                return (Class<? extends T>) klass;
            }
        }
    
        /**
         * Returns all the available service resources matching the
         *  given pattern, such as all instances of tika-mimetypes.xml 
         *  on the classpath, or all org.apache.tika.parser.Parser 
         *  service files.
         */
        public Enumeration<URL> findServiceResources(String filePattern) {
           try {
              Enumeration<URL> resources = loader.getResources(filePattern);
              return resources;
           } catch (IOException ignore) {
              // We couldn't get the list of service resource files
              List<URL> empty = Collections.emptyList();
              return Collections.enumeration( empty );
          }
        }
    
        /**
         * Returns all the available service providers of the given type.
         *
         * @param iface service provider interface
         * @return available service providers
         */
        public <T> List<T> loadServiceProviders(Class<T> iface) {
            List<T> providers = new ArrayList<T>();
            providers.addAll(loadDynamicServiceProviders(iface));
            providers.addAll(loadStaticServiceProviders(iface));
            return providers;
        }
    
        /**
         * Returns the available dynamic service providers of the given type.
         * The returned list is newly allocated and may be freely modified
         * by the caller.
         *
         * @since Apache Tika 1.2
         * @param iface service provider interface
         * @return dynamic service providers
         */
        @SuppressWarnings("unchecked")
        public <T> List<T> loadDynamicServiceProviders(Class<T> iface) {
            List<T> providers = new ArrayList<T>();
    
            if (dynamic) {
                synchronized (services) {
                    for (Object service : services.values()) {
                        if (iface.isAssignableFrom(service.getClass())) {
                            providers.add((T) service);
                        }
                    }
                }
            }
    
            return providers;
        }
    
        /**
         * Returns the available static service providers of the given type.
         * The providers are loaded using the service provider mechanism using
         * the configured class loader (if any). The returned list is newly
         * allocated and may be freely modified by the caller.
         *
         * @since Apache Tika 1.2
         * @param iface service provider interface
         * @return static service providers
         */
        @SuppressWarnings("unchecked")
        public <T> List<T> loadStaticServiceProviders(Class<T> iface) {
            List<T> providers = new ArrayList<T>();
    
            if (loader != null) {
                List<String> names = new ArrayList<String>();
    
                String serviceName = iface.getName();
                Enumeration<URL> resources =
                        findServiceResources("META-INF/services/" + serviceName);
                for (URL resource : Collections.list(resources)) {
                    try {
                        collectServiceClassNames(resource, names);
                    } catch (IOException e) {
                        handler.handleLoadError(serviceName, e);
                    }
                }
    
                for (String name : names) {
                    try {
                        Class<?> klass = loader.loadClass(name);
                        if (iface.isAssignableFrom(klass)) {
                            providers.add((T) klass.newInstance());
                        }
                    } catch (Throwable t) {
                        handler.handleLoadError(name, t);
                    }
                }
            }
    
            return providers;
        }
    
        private static final Pattern COMMENT = Pattern.compile("#.*");
    
        private static final Pattern WHITESPACE = Pattern.compile("\\s+");
    
        private void collectServiceClassNames(URL resource, Collection<String> names)
                throws IOException {
            InputStream stream = resource.openStream();
            try {
                BufferedReader reader =
                    new BufferedReader(new InputStreamReader(stream, "UTF-8"));
                String line = reader.readLine();
                while (line != null) {
                    line = COMMENT.matcher(line).replaceFirst("");
                    line = WHITESPACE.matcher(line).replaceAll("");
                    if (line.length() > 0) {
                        names.add(line);
                    }
                    line = reader.readLine();
                }
            } finally {
                stream.close();
            }
        }
    
    }
    ServiceLoader类的主要功能是加载服务类,分为动态加载服务类和静态加载服务类,分别对应List<T> loadDynamicServiceProviders(Class<T> iface)方法和List<T> loadStaticServiceProviders(Class<T> iface)方法

    HtmlParser类的私有成员
    static final ServiceLoader LOADER =new ServiceLoader(HtmlParser.class.getClassLoader())是只调用静态加载方法
    List<T> loadStaticServiceProviders(Class<T> iface)方法(this.dynamic值为false)
    加载jar文件里面路径为META-INF/services/org.apache.tika.detect.EncodingDetector的文件
    #  Licensed to the Apache Software Foundation (ASF) under one or more
    #  contributor license agreements.  See the NOTICE file distributed with
    #  this work for additional information regarding copyright ownership.
    #  The ASF licenses this file to You under the Apache License, Version 2.0
    #  (the "License"); you may not use this file except in compliance with
    #  the License.  You may obtain a copy of the License at
    #
    #       http://www.apache.org/licenses/LICENSE-2.0
    #
    #  Unless required by applicable law or agreed to in writing, software
    #  distributed under the License is distributed on an "AS IS" BASIS,
    #  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    #  See the License for the specific language governing permissions and
    #  limitations under the License.
    
    org.apache.tika.parser.html.HtmlEncodingDetector
    org.apache.tika.parser.txt.UniversalEncodingDetector
    org.apache.tika.parser.txt.Icu4jEncodingDetector

     通过加载该文件获取编码识别类列表,最后AutoDetectReader类调用static Charset detect(InputStream input, Metadata metadata,List<EncodingDetector> detectors)方法获取文档的编码类型

    至于TagSoup组件我这里转载一篇博文供参考:

    TagSoup开发指南

    http://cactus-jing.iteye.com/blog/1070620

    对于TagSoup可能有些人会比较陌生,TagSoup是Java语言开发的,通过SAX引擎解析结构糟糕、令人抓狂的不规范HTML文档的小工具。TagSoup可以将一个HTML文档转换为结构良好的XML文档(近似于XHTML),方便开发人员对获取的HTML文档进行解析等操作。同时TagSoup提供了命令行程序,可以运行TagSoup来对HTML文档进行解析。 

    但是TagSoup的缺陷就是,官方网站(http://home.ccil.org/~cowan/XML/tagsoup/)上不提供API文档的链接,同时也不提供开发指南,只提供了一个40页的幻灯片(http://home.ccil.org/~cowan/XML/tagsoup/),是其在Extreme Markup Languages 2004上的演讲。这对于将TagSoup整合到自己的应用程序中还是遇到了很大的挑战! 

    使用TagSoup的开发流程: 

    • 创建Parser实例;
    • 提供自己的SAX2内容处理器
    • 提供只想需要解析的HTML的InputSource实例;
    • 开始parse()!

    由于个人能力有限,这几句话把我直接搞懵了,所以决定仔细研究下。 

    TagSoup包含2个包、16个类文件(文件数目还是比较少的,但是功能很强大!)。其中核心类包括Parser、PYXScanner、XMLWriter。 

    • org.ccil.cowan.tagsoup.Parser,该类继承自org.xml.sax.helpers.DefaultHandler,可知该类是一个SAX型的解析器;
    • org.ccil.cowan.tagsoup.PYXScanner,该类实现了Scanner接口,用于读取解析后的内容;
    • org.ccil.cowan.tagsoup.XMLWriter,该类继承自org.xml.sax.helpers.XMLFilterImpl,同时实现org.xml.sax.ContentHandler接口(这个是最主要的),也就是说XMLWriter是TagSoup为我们提供的HTML解析成XML文档的默认实现。
    那么找到了这三个核心类,就按照上面的流程开始解析吧,以下是我个人写的一个小例子:
     
    StringReader xmlReader = new StringReader("");
    StringReader sr = new StringReader(html);
    InputSource src = new InputSource(sr);//构建InputSource实例
    Parser parser = new Parser();//实例化Parse
    XMLWriter writer = new XMLWriter();//实例化XMLWriter,即SAX内容处理器
    parser.setContentHandler(writer);//设置内容处理器
    parser.parse(src);//解析
    Scanner scan = new PYXScanner();
    scan.scan(xmlReader, parser);//通过xmlReader读取解析后的结果
    char[] buff = new char[1024];
    while(xmlReader.read(buff) != -1) {
        System.out.println(new String(buff));//打印解析后的结构良好的HTML文档
    } 

     tagsoup-1.2.jar (87.9 KB)

  • 相关阅读:
    Oracle 分析函数(Analytic Functions) 说明
    Build Your Own Oracle RAC 10g Release 2 Cluster on Linux and FireWire
    Build Your Own Oracle RAC 10g Release 2 Cluster on Linux and FireWire
    ORACLE SEQUENCE 介绍
    RAC Ocfs2文件系统常见问题解决方法
    linux 下修改日期和时间
    寒假刷题之7——波纹
    iOS 游戏 Oh my fish! 切割痕迹实现
    ACM常识
    寒假刷题之6——迷宫
  • 原文地址:https://www.cnblogs.com/chenying99/p/2948588.html
Copyright © 2011-2022 走看看