用juniversalchardet解决爬虫乱码问题

zoukankan html css js c++ java

用juniversalchardet解决爬虫乱码问题
爬虫往往会遇到乱码问题。最简单的方法是根据http的响应信息来获取编码信息。但如果对方网站的响应信息不包含编码信息或编码信息错误，那么爬虫取下来的信息就很可能是乱码。

       好的解决办法是直接根据页面内容来自动判断页面的编码。如Mozilla公司的firefox使用的universalchardet编码自动检测工具。

       juniversalchardet是universalchardet的Java版本。源码开源于 https://github.com/thkoch2001/juniversalchardet

       自动编码主要是根据统计学的方法来判断。具体原理，可以看http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

       现在以Java爬虫常用的httpclient来讲解如何使用。看以下关键代码：
UniversalDetector encDetector = new UniversalDetector(null); while ((l = myStream.read(tmp)) != -1) { buffer.append(tmp, 0, l); if (!encDetector.isDone()) { encDetector.handleData(tmp, 0, l); } } encDetector.dataEnd(); String encoding = encDetector.getDetectedCharset(); if (encoding != null) { return new String(buffer.toByteArray(), encoding); } encDetector.reset();

　　
myStream.read(tmp)) 读取httpclient得到的流。我们要做的就是在读流的同时，运用juniversalchardet来检测编码，如果有符合特征的编码的出现，则最后可通过detector.getDetectedCharset()

可以得到编码，否则返回null。至此，检测工作结束，通过String的构造方法来进行按一定编码构建字符串。
http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3
 <dependency> <groupId>com.googlecode.juniversalchardet</groupId> <artifactId>juniversalchardet</artifactId> <version>1.0.3</version> </dependency>
　　

https://code.google.com/archive/p/juniversalchardet/

Java port of universalchardet

1. What is it?

juniversalchardet is a Java port of 'universalchardet', that is the encoding detector library of Mozilla.

The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

2. Encodings that can be detected
- Chinese
  
  ISO-2022-CN
  
  BIG5
  
  EUC-TW
  
  GB18030
  
  HZ-GB-2312¹
- Cyrillic
  
  ISO-8859-5
  
  KOI8-R
  
  WINDOWS-1251
  
  MACCYRILLIC
  
  IBM866
  
  IBM855
- Greek
  
  ISO-8859-7
  
  WINDOWS-1253
- Hebrew
  
  ISO-8859-8
  
  WINDOWS-1255
- Japanese
  
  ISO-2022-JP
  
  SHIFT_JIS
  
  EUC-JP
- Korean
  
  ISO-2022-KR
  
  EUC-KR
- Unicode
  
  UTF-8
  
  UTF-16BE / UTF-16LE
  
  UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-3412¹ / X-ISO-10646-UCS-4-2143¹
- Others
  
  WINDOWS-1252
1 Currently not supported by Java

3. How to use it
1. Construct an instance of org.mozilla.universalchardet.UniversalDetector.
2. Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
3. Notify the detector of the end of data by calling UniversalDetector.dataEnd().
4. Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
5. Don't forget to call UniversalDetector.reset() before you reuse the detector instance.
Sample Code

Download ``` import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector { public static void main(String[] args) throws java.io.IOException { byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
```
// (1)
UniversalDetector detector = new UniversalDetector(null);

// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
  detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
  System.out.println("Detected encoding = " + encoding);
} else {
  System.out.println("No encoding detected.");
}

// (5)
detector.reset();
```
} } ```

4. Related Works

jchardet
- http://jchardet.sourceforge.net/ jchardet is another Java port of the Mozilla's encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the 'chardet' module that has long existed. juniversalchardet is based on the 'universalchardet' module that is new and generally provides better accuracy on detection results.
5. License

The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.
查看全文

相关阅读:
PAT A1097 Deduplication on a Linked List （25 分）——链表
 PAT A1115 Counting Nodes in a BST （30 分）——二叉搜索树，层序遍历或者dfs
PAT A1113 Integer Set Partition （25 分）——排序题
 PAT A1112 Stucked Keyboard （20 分）——字符串
 PAT A1118 Birds in Forest （25 分）——并查集
 JAVA入门之程序设计环境搭建
 Win7命令终端基础配色指南
 泛微e-cology和Oracle无法启动的解决方案
 C指针和数组
 float类型与16进制的相互转换

原文地址：https://www.cnblogs.com/lhp2012/p/6888318.html

用juniversalchardet解决爬虫乱码问题

1. What is it?

2. Encodings that can be detected

3. How to use it

Sample Code

4. Related Works

jchardet

5. License