zoukankan html css js c++ java

Apache Tika源码研究（一）

因为采用Apache Tika解析网页文件时产生乱码问题，所以后来仔细看了一下Apache Tika源码

先浏览一下tika编码识别的相关接口和类的UML模型

下面是编码识别接口，EncodingDetector.java

public interface EncodingDetector {

    /**
     * Detects the character encoding of the given text document, or
     * <code>null</code> if the encoding of the document can not be detected.
     * <p>
     * If the document input stream is not available, then the first
     * argument may be <code>null</code>. Otherwise the detector may
     * read bytes from the start of the stream to help in encoding detection.
     * The given stream is guaranteed to support the
     * {@link InputStream#markSupported() mark feature} and the detector
     * is expected to {@link InputStream#mark(int) mark} the stream before
     * reading any bytes from it, and to {@link InputStream#reset() reset}
     * the stream before returning. The stream must not be closed by the
     * detector.
     * <p>
     * The given input metadata is only read, not modified, by the detector.
     *
     * @param input text document input stream, or <code>null</code>
     * @param metadata input metadata for the document
     * @return detected character encoding, or <code>null</code>
     * @throws IOException if the document input stream could not be read
     */
    Charset detect(InputStream input, Metadata metadata) throws IOException;

}

编码识别接口EncodingDetector的实现类有三，分别是HtmlEncodingDetector，UniversalEncodingDetector，和Icu4jEncodingDetector

从三者的名称基本可以看出他们的功能或所用的组件，Tika默认的网页编码识别是存在问题的，当解析网页文件时，在网页的html元素里面编码有误的时候就会产生乱码。

HtmlEncodingDetector类的源码如下：

public class HtmlEncodingDetector implements EncodingDetector {

    // TIKA-357 - use bigger buffer for meta tag sniffing (was 4K)
    private static final int META_TAG_BUFFER_SIZE = 8192;

    private static final Pattern HTTP_EQUIV_PATTERN = Pattern.compile(
            "(?is)<meta\\s+http-equiv\\s*=\\s*['\\\"]\\s*"
            + "Content-Type['\\\"]\\s+content\\s*=\\s*['\\\"]"
            + "([^'\\\"]+)['\\\"]");

    private static final Pattern META_CHARSET_PATTERN = Pattern.compile(
            "(?is)<meta\\s+charset\\s*=\\s*['\\\"]([^'\\\"]+)['\\\"]");

    private static final Charset ASCII = Charset.forName("US-ASCII");

    public Charset detect(InputStream input, Metadata metadata)
            throws IOException {
        if (input == null) {
            return null;
        }

        // Read enough of the text stream to capture possible meta tags
        input.mark(META_TAG_BUFFER_SIZE);
        byte[] buffer = new byte[META_TAG_BUFFER_SIZE];
        int n = 0;
        int m = input.read(buffer);
        while (m != -1 && n < buffer.length) {
            n += m;
            m = input.read(buffer, n, buffer.length - n);
        }
        input.reset();

        // Interpret the head as ASCII and try to spot a meta tag with
        // a possible character encoding hint
        String charset = null;
        String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString();

        Matcher equiv = HTTP_EQUIV_PATTERN.matcher(head);
        if (equiv.find()) {
            MediaType type = MediaType.parse(equiv.group(1));
            if (type != null) {
                charset = type.getParameters().get("charset");
            }
        }
        if (charset == null) {
            // TIKA-892: HTML5 meta charset tag
            Matcher meta = META_CHARSET_PATTERN.matcher(head);
            if (meta.find()) {
                charset = meta.group(1);
            }
        }

        if (charset != null) {
            try {
                return CharsetUtils.forName(charset);
            } catch (Exception e) {
                // ignore
            }
        }

        return null;
    }

}

如果需要正确的编码，需要改写

public Charset detect(InputStream input, Metadata metadata)方法

接下来分析另外一个重要的实现类UniversalEncodingDetector，从它的名称基本可以猜测到时采用的juniversalchardet组件，其代码如下：

public class UniversalEncodingDetector implements EncodingDetector {

    private static final int BUFSIZE = 1024;

    private static final int LOOKAHEAD = 16 * BUFSIZE;

    public Charset detect(InputStream input, Metadata metadata)
            throws IOException {
        if (input == null) {
            return null;
        }

        input.mark(LOOKAHEAD);
        try {
            UniversalEncodingListener listener =
                    new UniversalEncodingListener(metadata);

            byte[] b = new byte[BUFSIZE];
            int n = 0;
            int m = input.read(b);
            while (m != -1 && n < LOOKAHEAD && !listener.isDone()) {
                n += m;
                listener.handleData(b, 0, m);
                m = input.read(b, 0, Math.min(b.length, LOOKAHEAD - n));
            }

            return listener.dataEnd();
        } catch (IOException e) {
            throw e;
        } catch (Exception e) { // if juniversalchardet is not available
            return null;
        } finally {
            input.reset();
        }
    }

}

这里编码识别是通过UniversalEncodingListener类，该类实现了CharsetListener接口，该接口是juniversalchardet组件的接口，代码如下：

package org.mozilla.universalchardet;

public interface CharsetListener
{
    public void report(String charset);
}

该接口待实现的只有void report(String charset)方法

UniversalEncodingListener类持有私有成员

private final UniversalDetector detector = new UniversalDetector(this);

这里的this即为其本身，我们可以猜测到detector对象是通过回调CharsetListener接口的void report(String charset)方法传回编码类型字符串的

可以看到UniversalDetector类里面自带的测试代码：

public static void main(String[] args) throws Exception
    {
        if (args.length != 1) {
            System.out.println("USAGE: java UniversalDetector filename");
            return;
        }

        UniversalDetector detector = new UniversalDetector(
                new CharsetListener() {
                    public void report(String name)
                    {
                        System.out.println("charset = " + name);
                    }
                }
                );
        
        byte[] buf = new byte[4096];
        java.io.FileInputStream fis = new java.io.FileInputStream(args[0]);
        
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        detector.dataEnd();
    }

如果我们的程序采用UniversalEncodingDetector类来识别文件编码，代码怎么实现呢？下面是调用方法：

public static void main(String[] args) throws IOException, TikaException {
        // TODO Auto-generated method stub
        File file=new File("[文件路径]");
        InputStream stream=null;        
        try
        {
            stream=new FileInputStream(file);
            EncodingDetector  detector=new UniversalEncodingDetector();
            Charset charset = detector.detect(new BufferedInputStream(stream), new Metadata());
            System.out.println("编码："+charset.name());    
        }
        finally
        {
            if (stream != null)    stream.close();
        }
    }

 第三个类是Icu4jEncodingDetector，从名称可以看出是采用的IBM的Icu4j组件，代码如下：

public class Icu4jEncodingDetector implements EncodingDetector {

    public Charset detect(InputStream input, Metadata metadata)
            throws IOException {
        if (input == null) {
            return null;
        }

        CharsetDetector detector = new CharsetDetector();

        String incomingCharset = metadata.get(Metadata.CONTENT_ENCODING);
        String incomingType = metadata.get(Metadata.CONTENT_TYPE);
        if (incomingCharset == null && incomingType != null) {
            // TIKA-341: Use charset in content-type
            MediaType mt = MediaType.parse(incomingType);
            if (mt != null) {
                incomingCharset = mt.getParameters().get("charset");
            }
        }

        if (incomingCharset != null) {
            detector.setDeclaredEncoding(CharsetUtils.clean(incomingCharset));
        }

        // TIKA-341 without enabling input filtering (stripping of tags)
        // short HTML tests don't work well
        detector.enableInputFilter(true);

        detector.setText(input);

        for (CharsetMatch match : detector.detectAll()) {
            try {
                return CharsetUtils.forName(match.getName());
            } catch (Exception e) {
                // ignore
            }
        }

        return null;
    }

}

最关键的类是CharsetDetector，这里暂不进一步分析这个类了

下面转帖网上的一篇博文

《使用ICU4J探测文档编码》

http://blog.csdn.net/cnhome/article/details/6973343

曾经使用过这个东东，还是不错的，中国人的一篇论文，最早的时候好像是在哪个开源浏览器里。

网页源码的编码探测一般有两种方式，一种是通过分析网页源码中Meta信息，比如contentType，来取得编码，但是某些网页不的contentType中不含任何编码信息，这时需要通过第二种方式进行探测，第二种是使用统计学和启发式方法对网页源码进行编码探测。ICU4J就是基于第二种方式的类库。由IBM提供。

下面的例子演示了一个简单的探测过程。

package com.huilan.dig.contoller;

import java.io.IOException;
import java.io.InputStream;

import com.ibm.icu.text.CharsetDetector;
import com.ibm.icu.text.CharsetMatch;

/**
* 本类使用ICU4J包进行文档编码获取
*
*/
public class EncodeDetector {
    /**
    * 获取编码
    * @throws IOException
    * @throws Exception
    */
    public static String getEncode(byte[] data,String url){
       CharsetDetector detector = new CharsetDetector();
       detector.setText(data);
       CharsetMatch match = detector.detect();
       String encoding = match.getName();
       System.out.println("The Content in " + match.getName());
       CharsetMatch[] matches = detector.detectAll();
       System.out.println("All possibilities");
       for (CharsetMatch m : matches) {
        System.out.println("CharsetName:" + m.getName() + " Confidence:"
          + m.getConfidence());
       }
       return encoding;
    }
    public static String getEncode(InputStream data,String url) throws IOException{
       CharsetDetector detector = new CharsetDetector();
       detector.setText(data);
       CharsetMatch match = detector.detect();
       String encoding = match.getName();
       System.out.println("The Content in " + match.getName());
       CharsetMatch[] matches = detector.detectAll();
       System.out.println("All possibilities");
       for (CharsetMatch m : matches) {
        System.out.println("CharsetName:" + m.getName() + " Confidence:"
          + m.getConfidence());
       }
       return encoding;
    }

}

查看全文

相关阅读:
win2008R2超过激活期，但是一个小时之后就会自动强制关机的问题
 SQL数据库中出现大量用户登录错误怎么解决
 VMware上安装centos 6.3 不能联网问题
 最简单的正则表达式
 双向认证SSL原理
 【虚拟机】在VMware中安装Server2008之后配置网络连接的几种方式
 win7 windows server 2008R2下 https SSL证书安装的搭配(搭配https ssl本地测试环境)
使用word优雅的添加代码并上传到博客园（附：批量取消图片超链接方法）
IIS服务器多域证书使用443端口解决方案
 服务器证书安装配置指南（IIS7）

原文地址：https://www.cnblogs.com/chenying99/p/2947296.html