zoukankan      html  css  js  c++  java
  • 有关C# httpresponse 404 page not found error 的处理方案


         需求分析:本人最近做一个项目,项目中需要从新闻的索引页(就是上面有很多链接的那种网页),获取新闻正文页源码,并将新闻正文页源码保存到本地数据库中。
        但是由于网络稳定性的原因,总会出现 404 page not found 类型的error。(但是网页是确确实实存在的)。而且这种错误,往往是在程序运行一段时间后出现的,觉得很不可思议。我在网络上查这种问题的解决方案时,发现没有一种管用的。本人现在已经成功解决该问题,遂将自己的解决方案写下来和大家分享与探讨。
    解决方案核心:一旦出现这种错误,程序中就递归调用下载函数本身。代码说明如下:

     public static string GetDataFromUrl(string url, int nRetryTimes)
            {
                if (nRetryTimes == 0)
                    return string.Empty;

                string result = string.Empty;
                try
                {
                    result=GetDataFromUrl(url);
                }
                catch (System.Exception exc)
                {
                    if(exc.Message.IndexOf("404")!=-1)
                    {
                        result=GetDataFromUrl(url,nRetryTimes-1);
                    }
                }
                return result;
            }
    其中nRetryTimes 代表出现这种错误后,函数递归调用自己的次数,也可以理解为递归终止的条件。GetDataFromUrl(string url)函数代码如下:

     public static string GetDataFromUrl(string url)
           {
               string str = string.Empty;
                HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
                //设置http头
                request.AllowAutoRedirect = true;
                request.AllowWriteStreamBuffering = true;
                request.Referer = "";
                request.Timeout = 10000000;
                request.UserAgent = "";
                ;
                request.KeepAlive = false;//to avoid the error of time out
                HttpWebResponse response = null;
                response = (HttpWebResponse)request.GetResponse();

                //根据http应答的http头来判断编码
                string characterSet = response.CharacterSet;
                Encoding encode;
                if (characterSet != "")
                {
                    if (characterSet == "ISO-8859-1")
                    {
                        characterSet = "gb2312";
                    }
                    encode = Encoding.GetEncoding(characterSet);
                }
                else
                {
                    encode = Encoding.Default;
                }

                //声明一个内存流来保存http应答流
                Stream receiveStream = response.GetResponseStream();
                MemoryStream mStream = new MemoryStream();

                byte[] bf = new byte[255];
                int count = receiveStream.Read(bf, 0, 255);
                while (count > 0)
                {
                    mStream.Write(bf, 0, count);
                    count = receiveStream.Read(bf, 0, 255);
                }
                receiveStream.Close();

                mStream.Seek(0, SeekOrigin.Begin);

                //从内存流里读取字符串
                StreamReader reader = new StreamReader(mStream, encode);
                char[] buffer = new char[1024];
                count = reader.Read(buffer, 0, 1024);
                while (count > 0)
                {
                    str += new String(buffer, 0, count);
                    count = reader.Read(buffer, 0, 1024);
                }

                //从解析出的字符串里判断charset,如果和http应答的编码不一直
                //那么以页面声明的为准,再次从内存流里重新读取文本
                Regex reg =
                   new Regex(@"<meta[\s\S]+?charset=(.*?)""[\s\S]+?>",
                              RegexOptions.Multiline | RegexOptions.IgnoreCase);
                MatchCollection mc = reg.Matches(str);
                if (mc.Count > 0)
                {
                    string tempCharSet = mc[0].Result("$1");
                    if (string.Compare(tempCharSet, characterSet, true) != 0)
                    {
                        encode = Encoding.GetEncoding(tempCharSet);
                        str = string.Empty;
                        mStream.Seek(0, SeekOrigin.Begin);
                        reader = new StreamReader(mStream, encode);
                        buffer = new char[255];
                        count = reader.Read(buffer, 0, 255);
                        while (count > 0)
                        {
                            str += new String(buffer, 0, count);
                            count = reader.Read(buffer, 0, 255);
                        }
                    }
                }
                reader.Close();
                mStream.Close();
                if (response != null)
                    response.Close();

                return str;

           
     
             
           }

    值得说明的是:尽管采用了此方法,当你查看数据库的时候,你还是会发现有些正文源码没有下载下来。拿我的数据表单来说:我的数据库表单的各个属性如下 ArticlePageId,--数据表的主键。ArticlePageTitle--新闻标题,ArticlePageUrl,--新闻正文页URL,ArticlePageSource--新闻正文页源码,也就是从ArticlePageUrl下载的源码。如果ArticlePageSource字段为空,则表明,下载失败。于是,我又加了一个打补丁的模块。代码如下:

    把补丁的模块代码


    PS:我是新手,这也是我第一次选择首页发帖和大家分享我的一点收获和见解。如有不对的地方还请各位前辈指证。以免误认子弟。



  • 相关阅读:
    保持URL不变和数字验证
    centOS ftp key?
    本地环境测试二级域名
    linux 解决You don't have permission to access 问题
    php smarty section loop
    php header Cannot modify header information headers already sent by ... 解决办法
    linux部分命令
    Linux 里面的文件操作权限说明
    用IT网络和安全专业人士视角来裁剪云的定义
    SQL Server 2008 R2炫酷报表"智"作有方
  • 原文地址:https://www.cnblogs.com/finallyliuyu/p/1533863.html
Copyright © 2011-2022 走看看