zoukankan html css js c++ java

网络爬虫速成指南（三）编码识别

问题的提出：
采用上节的方法偶尔会下载到的HTML乱码，原因是上节的代码中进行了简易的编码识别，比如根据头信息，
根据meta中的charset：<meta http-equiv="Content-type" content="text/html; charset=gb2312" />。
即使这样也会遇到下载到乱码的情况，原因是这两者提供的charset都可能不准确。
解决方案：
1 手动指定编码
2 自动识别编码
如果只采一个网站，自己指定下编码就好了，
但是如果是海量的采集那就不能一个网站一个网站的去指定编码了。
本节介绍两个包用来自动识别编码。

一下是两个java的编码识别的包及使用示例。net的也有类似的包，忘记名字了。

参考源：
http://code.google.com/p/juniversalchardet/



package cn.tdt.crawl.encoding;

import java.io.File;
import java.io.IOException;

import org.mozilla.universalchardet.UniversalDetector;

public class DetectorDemo {

    private static java.io.FileInputStream fis;

    public static void main(String[] args) throws IOException {
        
        String fileName = "F:/qq.txt";
        File f = new File(fileName);
        fis = new java.io.FileInputStream(f);
        
        //method 1:        
//        byte[] data = new byte[(int) f.length()];
//        for (int i = 0; i < data.length; i++) {
//            data[i] = (byte) fis.read();
//        }        
//        String encoding = Icu4jDetector.getEncode(data);
//        System.out.println(encoding);
        
        byte[] buf = new byte[4096];
        UniversalDetector detector = new UniversalDetector(null);
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        if (encoding != null) {
            System.out.println("Detected encoding = " + encoding);
        } else {
            System.out.println("No encoding detected.");
        }
        detector.reset();

    }

}

package cn.tdt.crawl.encoding;

import java.io.IOException;
import java.io.InputStream;

import com.ibm.icu.text.CharsetDetector;
import com.ibm.icu.text.CharsetMatch;

public class Icu4jDetector {
    
    public static String getEncode(byte[] data){
           CharsetDetector detector = new CharsetDetector();
           detector.setText(data);
           CharsetMatch match = detector.detect();
           String encoding = match.getName();
           System.out.println("The Content in " + match.getName());
           CharsetMatch[] matches = detector.detectAll();
           System.out.println("All possibilities");
           for (CharsetMatch m : matches) {
            System.out.println("CharsetName:" + m.getName() + " Confidence:"
              + m.getConfidence());
           }
           return encoding;
        }
    
    public static String getEncode(InputStream data,String url) throws IOException{
           CharsetDetector detector = new CharsetDetector();
           detector.setText(data);
           CharsetMatch match = detector.detect();
           String encoding = match.getName();
           System.out.println("The Content in " + match.getName());
           CharsetMatch[] matches = detector.detectAll();
           System.out.println("All possibilities");
           for (CharsetMatch m : matches) {
            System.out.println("CharsetName:" + m.getName() + " Confidence:"
              + m.getConfidence());
           }
           return encoding;
        }
    
}

查看全文

相关阅读:
SQL后台分页三种方案和分析
 SQL分页查询语句
 SQL利用临时表实现动态列、动态添加列
 查询sybase DB中占用空间最多的前20张表
 敏捷软件开发之TDD（一）
敏捷软件开发之开篇
 Sql Server 2012启动存储过程
 改变VS2013的菜单栏字母为小写
 Sql Server获得每个表的行数
 Sql Server trace flags

原文地址：https://www.cnblogs.com/i80386/p/3255075.html

网络爬虫速成指南 （三） 编码识别

网络爬虫速成指南（三）编码识别