zoukankan      html  css  js  c++  java
  • Apache Tika源码研究(一)

    因为采用Apache Tika解析网页文件时产生乱码问题,所以后来仔细看了一下Apache Tika源码

    先浏览一下tika编码识别的相关接口和类的UML模型

    下面是编码识别接口,EncodingDetector.java

    public interface EncodingDetector {
    
        /**
         * Detects the character encoding of the given text document, or
         * <code>null</code> if the encoding of the document can not be detected.
         * <p>
         * If the document input stream is not available, then the first
         * argument may be <code>null</code>. Otherwise the detector may
         * read bytes from the start of the stream to help in encoding detection.
         * The given stream is guaranteed to support the
         * {@link InputStream#markSupported() mark feature} and the detector
         * is expected to {@link InputStream#mark(int) mark} the stream before
         * reading any bytes from it, and to {@link InputStream#reset() reset}
         * the stream before returning. The stream must not be closed by the
         * detector.
         * <p>
         * The given input metadata is only read, not modified, by the detector.
         *
         * @param input text document input stream, or <code>null</code>
         * @param metadata input metadata for the document
         * @return detected character encoding, or <code>null</code>
         * @throws IOException if the document input stream could not be read
         */
        Charset detect(InputStream input, Metadata metadata) throws IOException;
    
    }

    编码识别接口EncodingDetector的实现类有三,分别是HtmlEncodingDetector,UniversalEncodingDetector,和Icu4jEncodingDetector

    从三者的名称基本可以看出他们的功能或所用的组件,Tika默认的网页编码识别是存在问题的,当解析网页文件时,在网页的html元素里面编码有误的时候就会产生乱码。

    HtmlEncodingDetector类的源码如下:

    public class HtmlEncodingDetector implements EncodingDetector {
    
        // TIKA-357 - use bigger buffer for meta tag sniffing (was 4K)
        private static final int META_TAG_BUFFER_SIZE = 8192;
    
        private static final Pattern HTTP_EQUIV_PATTERN = Pattern.compile(
                "(?is)<meta\\s+http-equiv\\s*=\\s*['\\\"]\\s*"
                + "Content-Type['\\\"]\\s+content\\s*=\\s*['\\\"]"
                + "([^'\\\"]+)['\\\"]");
    
        private static final Pattern META_CHARSET_PATTERN = Pattern.compile(
                "(?is)<meta\\s+charset\\s*=\\s*['\\\"]([^'\\\"]+)['\\\"]");
    
        private static final Charset ASCII = Charset.forName("US-ASCII");
    
        public Charset detect(InputStream input, Metadata metadata)
                throws IOException {
            if (input == null) {
                return null;
            }
    
            // Read enough of the text stream to capture possible meta tags
            input.mark(META_TAG_BUFFER_SIZE);
            byte[] buffer = new byte[META_TAG_BUFFER_SIZE];
            int n = 0;
            int m = input.read(buffer);
            while (m != -1 && n < buffer.length) {
                n += m;
                m = input.read(buffer, n, buffer.length - n);
            }
            input.reset();
    
            // Interpret the head as ASCII and try to spot a meta tag with
            // a possible character encoding hint
            String charset = null;
            String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();
    
            Matcher equiv = HTTP_EQUIV_PATTERN.matcher(head);
            if (equiv.find()) {
                MediaType type = MediaType.parse(equiv.group(1));
                if (type != null) {
                    charset = type.getParameters().get("charset");
                }
            }
            if (charset == null) {
                // TIKA-892: HTML5 meta charset tag
                Matcher meta = META_CHARSET_PATTERN.matcher(head);
                if (meta.find()) {
                    charset = meta.group(1);
                }
            }
    
            if (charset != null) {
                try {
                    return CharsetUtils.forName(charset);
                } catch (Exception e) {
                    // ignore
                }
            }
    
            return null;
        }
    
    }

    如果需要正确的编码,需要改写

    public Charset detect(InputStream input, Metadata metadata)方法

    接下来分析另外一个重要的实现类UniversalEncodingDetector,从它的名称基本可以猜测到时采用的juniversalchardet组件,其代码如下:

    public class UniversalEncodingDetector implements EncodingDetector {
    
        private static final int BUFSIZE = 1024;
    
        private static final int LOOKAHEAD = 16 * BUFSIZE;
    
        public Charset detect(InputStream input, Metadata metadata)
                throws IOException {
            if (input == null) {
                return null;
            }
    
            input.mark(LOOKAHEAD);
            try {
                UniversalEncodingListener listener =
                        new UniversalEncodingListener(metadata);
    
                byte[] b = new byte[BUFSIZE];
                int n = 0;
                int m = input.read(b);
                while (m != -1 && n < LOOKAHEAD && !listener.isDone()) {
                    n += m;
                    listener.handleData(b, 0, m);
                    m = input.read(b, 0, Math.min(b.length, LOOKAHEAD - n));
                }
    
                return listener.dataEnd();
            } catch (IOException e) {
                throw e;
            } catch (Exception e) { // if juniversalchardet is not available
                return null;
            } finally {
                input.reset();
            }
        }
    
    }

    这里编码识别 是通过UniversalEncodingListener类,该类实现了CharsetListener接口,该接口是juniversalchardet组件的接口,代码如下:

    package org.mozilla.universalchardet;
    
    public interface CharsetListener
    {
        public void report(String charset);
    }

    该接口待实现的只有void report(String charset)方法

    UniversalEncodingListener类持有私有成员

     private final UniversalDetector detector = new UniversalDetector(this);

    这里的this即为其本身,我们可以猜测到detector对象是通过回调CharsetListener接口的void report(String charset)方法传回编码类型字符串的

    可以看到UniversalDetector类里面自带的测试代码:

    public static void main(String[] args) throws Exception
        {
            if (args.length != 1) {
                System.out.println("USAGE: java UniversalDetector filename");
                return;
            }
    
            UniversalDetector detector = new UniversalDetector(
                    new CharsetListener() {
                        public void report(String name)
                        {
                            System.out.println("charset = " + name);
                        }
                    }
                    );
            
            byte[] buf = new byte[4096];
            java.io.FileInputStream fis = new java.io.FileInputStream(args[0]);
            
            int nread;
            while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
                detector.handleData(buf, 0, nread);
            }
            detector.dataEnd();
        }
    如果我们的程序采用UniversalEncodingDetector类来识别文件编码,代码怎么实现呢?下面是调用方法:
    public static void main(String[] args) throws IOException, TikaException {
            // TODO Auto-generated method stub
            File file=new File("[文件路径]");
            InputStream stream=null;        
            try
            {
                stream=new FileInputStream(file);
                EncodingDetector  detector=new UniversalEncodingDetector();
                Charset charset = detector.detect(new BufferedInputStream(stream), new Metadata());
                System.out.println("编码:"+charset.name());    
            }
            finally
            {
                if (stream != null)    stream.close();
            }
        }
     第三个类是Icu4jEncodingDetector,从名称可以看出是采用的IBM的Icu4j组件,代码如下:
    public class Icu4jEncodingDetector implements EncodingDetector {
    
        public Charset detect(InputStream input, Metadata metadata)
                throws IOException {
            if (input == null) {
                return null;
            }
    
            CharsetDetector detector = new CharsetDetector();
    
            String incomingCharset = metadata.get(Metadata.CONTENT_ENCODING);
            String incomingType = metadata.get(Metadata.CONTENT_TYPE);
            if (incomingCharset == null && incomingType != null) {
                // TIKA-341: Use charset in content-type
                MediaType mt = MediaType.parse(incomingType);
                if (mt != null) {
                    incomingCharset = mt.getParameters().get("charset");
                }
            }
    
            if (incomingCharset != null) {
                detector.setDeclaredEncoding(CharsetUtils.clean(incomingCharset));
            }
    
            // TIKA-341 without enabling input filtering (stripping of tags)
            // short HTML tests don't work well
            detector.enableInputFilter(true);
    
            detector.setText(input);
    
            for (CharsetMatch match : detector.detectAll()) {
                try {
                    return CharsetUtils.forName(match.getName());
                } catch (Exception e) {
                    // ignore
                }
            }
    
            return null;
        }
    
    }

    最关键的类是CharsetDetector,这里暂不进一步分析这个类了

    下面转帖网上的一篇博文  

    《使用ICU4J探测文档编码》

    http://blog.csdn.net/cnhome/article/details/6973343

    曾经使用过这个东东,还是不错的,中国人的一篇论文,最早的时候好像是在哪个开源浏览器里。 

    网页源码的编码探测一般有两种方式,一种是通过分析网页源码中Meta信息,比如contentType,来取得编码,但是某些网页不的contentType中不含任何编码信息,这时需要通过第二种方式进行探测,第二种是使用统计学和启发式方法对网页源码进行编码探测。ICU4J就是基于第二种方式的类库。由IBM提供。

    下面的例子演示了一个简单的探测过程。

    package com.huilan.dig.contoller;
    
    import java.io.IOException;
    import java.io.InputStream;
    
    import com.ibm.icu.text.CharsetDetector;
    import com.ibm.icu.text.CharsetMatch;
    
    /**
    * 本类使用ICU4J包进行文档编码获取
    *
    */
    public class EncodeDetector {
        /**
        * 获取编码
        * @throws IOException
        * @throws Exception
        */
        public static String getEncode(byte[] data,String url){
           CharsetDetector detector = new CharsetDetector();
           detector.setText(data);
           CharsetMatch match = detector.detect();
           String encoding = match.getName();
           System.out.println("The Content in " + match.getName());
           CharsetMatch[] matches = detector.detectAll();
           System.out.println("All possibilities");
           for (CharsetMatch m : matches) {
            System.out.println("CharsetName:" + m.getName() + " Confidence:"
              + m.getConfidence());
           }
           return encoding;
        }
        public static String getEncode(InputStream data,String url) throws IOException{
           CharsetDetector detector = new CharsetDetector();
           detector.setText(data);
           CharsetMatch match = detector.detect();
           String encoding = match.getName();
           System.out.println("The Content in " + match.getName());
           CharsetMatch[] matches = detector.detectAll();
           System.out.println("All possibilities");
           for (CharsetMatch m : matches) {
            System.out.println("CharsetName:" + m.getName() + " Confidence:"
              + m.getConfidence());
           }
           return encoding;
        }
    
    }
  • 相关阅读:
    Python 集合
    Python sorted()
    CodeForces 508C Anya and Ghosts
    CodeForces 496B Secret Combination
    CodeForces 483B Friends and Presents
    CodeForces 490C Hacking Cypher
    CodeForces 483C Diverse Permutation
    CodeForces 478C Table Decorations
    CodeForces 454C Little Pony and Expected Maximum
    CodeForces 313C Ilya and Matrix
  • 原文地址:https://www.cnblogs.com/chenying99/p/2947296.html
Copyright © 2011-2022 走看看