HtmlAgilityPack 抓取中文页面乱码问题的解决方案

zoukankan html css js c++ java

HtmlAgilityPack 抓取中文页面乱码问题的解决方案

    HtmlAgilityPack是用C#写的开源Html Parser。不过它的某些方面设计不尽完善，比如，按照其正常模式抓取中文网页，往往获得的是乱码。比如，抓取新华网首页(http://xinhua.org)。模仿HtmlAgilityPack示例，爬取代码如下：

            HtmlWeb hw = new HtmlWeb();
            string url = @"http://xinhua.org";
            HtmlDocument doc = hw.Load(url);
            doc.Save("output.html");

    获得的页面用ie打开，是乱码。

    穿越HtmlAgilityPack的代码迷宫，最后发现问题出在HtmlWeb类的Get(Uri uri, string method, string path, HtmlDocument doc)方法中。该方法有以下代码：

            HttpWebResponse resp;

            try
            {
                resp = req.GetResponse() as HttpWebResponse;
            }
            ……
            if ((resp.ContentEncoding != null) && (resp.ContentEncoding.Length>0))
            {
                respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding);
            }
            else
            {
                respenc = null;
            }
            ……
            Stream s = resp.GetResponseStream();
            if (s != null)
            {
                if (UsingCache)
                {
                    // NOTE: LastModified does not contain milliseconds, so we remove them to the file
                    SaveStream(s, cachePath, RemoveMilliseconds(resp.LastModified), _streamBufferSize);

                    // save headers
                    SaveCacheHeaders(req.RequestUri, resp);

                    if (path != null)
                    {
                        // copy and touch the file
                        IOLibrary.CopyAlways(cachePath, path);
                        File.SetLastWriteTime(path, File.GetLastWriteTime(cachePath));
                    }
                }
                else
                {
                    // try to work in-memory
                    if ((doc != null) && (html))
                    {
                        if (respenc != null)
                        {
                            doc.Load(s, respenc);
                        }
                        else
                        {
                            doc.Load(s, true);
                        }
                    }
                }
                resp.Close();
            }

其中resp是http请求的response。设置断点发现resp.ContentEncoding为空。于是最后的加载行为便变成了doc.Load(s, true);而这个load方法也可能出了问题，最后得到的是乱码。

解决方法：

不使用HttpWeb，该类不成熟。自己写http请求，代码如下：

            HttpWebRequest req;
            req = WebRequest.Create(new Uri(@"http://xinhua.org")) as HttpWebRequest;
            req.Method = "GET";
            WebResponse rs = req.GetResponse();
            Stream rss = rs.GetResponseStream();
            String url = @"http://xinhua.org";
            try
            {
                HtmlDocument doc = new HtmlDocument();
                doc.Load(rss);
                doc.Save("output.html");
            }
            catch (Exception e)
            {
                Console.WriteLine(e.Message.ToString());
                Console.WriteLine(e.StackTrace);
            }

上面代码中，doc.Load(…) 使用的编码为System.Text.Encoding.Default，在我机器上为gb2312编码。

HtmlDocument也可以指定编码load stream。获得指定编码有两种方法：
（1）在HttpWebResponse 对象中可以获取html代码中设置的charset；
（2）未提供charset的html页面，HtmlDocument提供了自动检测代码的方法DetectEncoding(…)。这一方法俺为测试过，不知道正确性如何.

版权所有，欢迎转载

查看全文

相关阅读:
Java 如何有效地避免OOM：善于利用软引用和弱引用
 LRU缓存实现(Java)
Java实现LRU（最近最少使用）缓存
 HashSet,TreeSet和LinkedHashSet的区别
 IIS-详解IIS中URL重写工具的规则条件(Rule conditions)
IIS-代理
 IIS-新建网站
 IIS-反向代理配置&&插件安装
 IIS-C#项目环境搭建
 IIS-Windows10如何安装

原文地址：https://www.cnblogs.com/xiaotie/p/794240.html