zoukankan      html  css  js  c++  java
  • 用juniversalchardet解决爬虫乱码问题

            爬虫往往会遇到乱码问题。最简单的方法是根据http的响应信息来获取编码信息。但如果对方网站的响应信息不包含编码信息或编码信息错误,那么爬虫取下来的信息就很可能是乱码。

           好的解决办法是直接根据页面内容来自动判断页面的编码。如Mozilla公司的firefox使用的universalchardet编码自动检测工具。

           juniversalchardet是universalchardet的Java版本。源码开源于 https://github.com/thkoch2001/juniversalchardet

           自动编码主要是根据统计学的方法来判断。具体原理,可以看http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

           现在以Java爬虫常用的httpclient来讲解如何使用。看以下关键代码:

     
    UniversalDetector encDetector = new UniversalDetector(null);  
        while ((l = myStream.read(tmp)) != -1) {  
            buffer.append(tmp, 0, l);  
            if (!encDetector.isDone()) {  
                encDetector.handleData(tmp, 0, l);  
            }  
        }  
    encDetector.dataEnd();  
    String encoding = encDetector.getDetectedCharset();  
    if (encoding != null) {  
        return new String(buffer.toByteArray(), encoding);  
    }  
    encDetector.reset();  
    

      

    1. myStream.read(tmp)) 读取httpclient得到的流。我们要做的就是在读流的同时,运用juniversalchardet来检测编码,如果有符合特征的编码的出现,则最后可通过detector.getDetectedCharset()  
    2. 可以得到编码,否则返回null。至此,检测工作结束,通过String的构造方法来进行按一定编码构建字符串。  



    http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3

    <!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
    <dependency>
        <groupId>com.googlecode.juniversalchardet</groupId>
        <artifactId>juniversalchardet</artifactId>
        <version>1.0.3</version>
    </dependency>
    

      

    https://code.google.com/archive/p/juniversalchardet/

    Java port of universalchardet

    1. What is it?

    juniversalchardet is a Java port of 'universalchardet', that is the encoding detector library of Mozilla.

    The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

    Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

    2. Encodings that can be detected

    • Chinese

      • ISO-2022-CN
      • BIG5
      • EUC-TW
      • GB18030
      • HZ-GB-23121
    • Cyrillic

      • ISO-8859-5
      • KOI8-R
      • WINDOWS-1251
      • MACCYRILLIC
      • IBM866
      • IBM855
    • Greek

      • ISO-8859-7
      • WINDOWS-1253
    • Hebrew

      • ISO-8859-8
      • WINDOWS-1255
    • Japanese

      • ISO-2022-JP
      • SHIFT_JIS
      • EUC-JP
    • Korean

      • ISO-2022-KR
      • EUC-KR
    • Unicode

      • UTF-8
      • UTF-16BE / UTF-16LE
      • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
    • Others

      • WINDOWS-1252

    1 Currently not supported by Java

    3. How to use it

    1. Construct an instance of org.mozilla.universalchardet.UniversalDetector.
    2. Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
    3. Notify the detector of the end of data by calling UniversalDetector.dataEnd().
    4. Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
    5. Don't forget to call UniversalDetector.reset() before you reuse the detector instance.

    Sample Code

    Download ``` import org.mozilla.universalchardet.UniversalDetector;

    public class TestDetector { public static void main(String[] args) throws java.io.IOException { byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);
    
    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();
    
    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }
    
    // (5)
    detector.reset();
    

    } } ```

    4. Related Works

    jchardet

    • http://jchardet.sourceforge.net/ jchardet is another Java port of the Mozilla's encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the 'chardet' module that has long existed. juniversalchardet is based on the 'universalchardet' module that is new and generally provides better accuracy on detection results.

    5. License

    The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.

  • 相关阅读:
    [git]git的简单配置使用 (将你的代码上传到Github)
    学习进度报告【第六周】
    [错误解决]SpringMVC接收对象 中文乱码问题解决
    [架构]myeclipse配置SpringMVC 以及简单应用 教程
    [机器学习]AttributeError: module 'tensorflow' has no attribute 'ConfigProto' 报错解决方法
    [机器学习]RuntimeError: The Session graph is empty. Add operations to the graph before calling run(). 报错解决方法
    [python]机器学习 k-mean 聚类分析
    学习进度报告【第五周】
    学习进度报告【第四周】
    unity3d优化总结篇
  • 原文地址:https://www.cnblogs.com/lhp2012/p/6888318.html
Copyright © 2011-2022 走看看