zoukankan      html  css  js  c++  java
  • 用juniversalchardet解决爬虫乱码问题

            爬虫往往会遇到乱码问题。最简单的方法是根据http的响应信息来获取编码信息。但如果对方网站的响应信息不包含编码信息或编码信息错误,那么爬虫取下来的信息就很可能是乱码。

           好的解决办法是直接根据页面内容来自动判断页面的编码。如Mozilla公司的firefox使用的universalchardet编码自动检测工具。

           juniversalchardet是universalchardet的Java版本。源码开源于 https://github.com/thkoch2001/juniversalchardet

           自动编码主要是根据统计学的方法来判断。具体原理,可以看http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

           现在以Java爬虫常用的httpclient来讲解如何使用。看以下关键代码:

     
    UniversalDetector encDetector = new UniversalDetector(null);  
        while ((l = myStream.read(tmp)) != -1) {  
            buffer.append(tmp, 0, l);  
            if (!encDetector.isDone()) {  
                encDetector.handleData(tmp, 0, l);  
            }  
        }  
    encDetector.dataEnd();  
    String encoding = encDetector.getDetectedCharset();  
    if (encoding != null) {  
        return new String(buffer.toByteArray(), encoding);  
    }  
    encDetector.reset();  
    

      

    1. myStream.read(tmp)) 读取httpclient得到的流。我们要做的就是在读流的同时,运用juniversalchardet来检测编码,如果有符合特征的编码的出现,则最后可通过detector.getDetectedCharset()  
    2. 可以得到编码,否则返回null。至此,检测工作结束,通过String的构造方法来进行按一定编码构建字符串。  



    http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3

    <!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
    <dependency>
        <groupId>com.googlecode.juniversalchardet</groupId>
        <artifactId>juniversalchardet</artifactId>
        <version>1.0.3</version>
    </dependency>
    

      

    https://code.google.com/archive/p/juniversalchardet/

    Java port of universalchardet

    1. What is it?

    juniversalchardet is a Java port of 'universalchardet', that is the encoding detector library of Mozilla.

    The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

    Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

    2. Encodings that can be detected

    • Chinese

      • ISO-2022-CN
      • BIG5
      • EUC-TW
      • GB18030
      • HZ-GB-23121
    • Cyrillic

      • ISO-8859-5
      • KOI8-R
      • WINDOWS-1251
      • MACCYRILLIC
      • IBM866
      • IBM855
    • Greek

      • ISO-8859-7
      • WINDOWS-1253
    • Hebrew

      • ISO-8859-8
      • WINDOWS-1255
    • Japanese

      • ISO-2022-JP
      • SHIFT_JIS
      • EUC-JP
    • Korean

      • ISO-2022-KR
      • EUC-KR
    • Unicode

      • UTF-8
      • UTF-16BE / UTF-16LE
      • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
    • Others

      • WINDOWS-1252

    1 Currently not supported by Java

    3. How to use it

    1. Construct an instance of org.mozilla.universalchardet.UniversalDetector.
    2. Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
    3. Notify the detector of the end of data by calling UniversalDetector.dataEnd().
    4. Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
    5. Don't forget to call UniversalDetector.reset() before you reuse the detector instance.

    Sample Code

    Download ``` import org.mozilla.universalchardet.UniversalDetector;

    public class TestDetector { public static void main(String[] args) throws java.io.IOException { byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    // (1)
    UniversalDetector detector = new UniversalDetector(null);
    
    // (2)
    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    // (3)
    detector.dataEnd();
    
    // (4)
    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }
    
    // (5)
    detector.reset();
    

    } } ```

    4. Related Works

    jchardet

    • http://jchardet.sourceforge.net/ jchardet is another Java port of the Mozilla's encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the 'chardet' module that has long existed. juniversalchardet is based on the 'universalchardet' module that is new and generally provides better accuracy on detection results.

    5. License

    The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.

  • 相关阅读:
    C#实现通过拼多多分享微信公众号实现查询优惠券、佣金比率
    淘宝客常用接口整理
    京东联盟开发(1) 商品SKUID采集
    Grafana 安装及 Windows 应用程序服务配置工具 NSSM使用
    Windows Server 2008R2 配置网络负载平衡(NLB)
    IIS 日志分析工具:Log Parser Studio
    curl: (25) Failed FTP upload: 550 解决方案
    搭建TFS 2015 Build Agent环境(四)
    Dump中查看dictionary信息的方法
    Dump中查看DataTime时间方法
  • 原文地址:https://www.cnblogs.com/lhp2012/p/6888318.html
Copyright © 2011-2022 走看看