C#网页采集数据的几种方式(WebClient、WebBrowser和HttpWebRequest/HttpWebResponse)

zoukankan html css js c++ java

C#网页采集数据的几种方式(WebClient、WebBrowser和HttpWebRequest/HttpWebResponse)
获取网页数据有很多种方式。在这里主要讲述通过WebClient、WebBrowser和HttpWebRequest/HttpWebResponse三种方式获取网页内容。

这里获取的是包括网页的所有信息。如果单纯需要某些数据内容。可以自己构造函数甄别抠除出来！一般的做法是根据源码的格式，用正则来过滤出你需要的内容部分。
一、通过WebClient获取网页内容

这是一种很简单的获取方式，当然，其它的获取方法也很简单。在这里首先要说明的是，如果为了实际项目的效率考虑，需要考虑在函数中分配一个内存区域。大概写法如下

[csharp] view plain copy

print ?

//MemoryStream是一个支持储存区为内存的流。

byte[] buffer = new byte[1024];

using (MemoryStream memory = new MemoryStream())

    {

    int index = 1, sum = 0;

    while (index * sum < 100 * 1024)

    {

       index = reader.Read(buffer, 0, 1024);

       if (index > 0)

       {

           memory.Write(buffer, 0, index);

            sum += index;

       }

    }

    //网页通常使用utf-8或gb2412进行编码

    Encoding.GetEncoding("gb2312").GetString(memory.ToArray());

    if (string.IsNullOrEmpty(html))

    {

        return html;

    }

    else

    {

        Regex re = new Regex(@"charset=(? charset[/s/S]*?)[ |']");

        Match m = re.Match(html.ToLower());

        encoding = m.Groups[charset].ToString();

    }

    if (string.IsNullOrEmpty(encoding) || string.Equals(encoding.ToLower(), "gb2312"))

    {

       return html;

    }

}

好了，现在进入正题，WebClient获取网页数据的代码如下

[csharp] view plain copy

print ?

//using System.IO;

try

{

    WebClient webClient = new WebClient();

    webClient.Credentials = CredentialCache.DefaultCredentials;//获取或设置用于向Internet资源的请求进行身份验证的网络凭据

    Byte[] pageData = webClient.DownloadData("http://www.360doc.com/content/11/0427/03/1947337_112596569.shtml");

    //string pageHtml = Encoding.Default.GetString(pageData);  //如果获取网站页面采用的是GB2312，则使用这句

    string pageHtml = Encoding.UTF8.GetString(pageData); //如果获取网站页面采用的是UTF-8，则使用这句

    using (StreamWriter sw = new StreamWriter("e:\ouput.txt"))//将获取的内容写入文本

    {

        htm = sw.ToString();//测试StreamWriter流的输出状态，非必须

        sw.Write(pageHtml);

    }

}

catch (WebException webEx)

{

    Console.W

}
二、通过WebBrowser控件获取网页内容

相对来说，这是一种最简单的获取方式。拖WebBrowser控件进去，然后匹配下面这段代码

[csharp] view plain copy

print ?

WebBrowser web = new WebBrowser();

web.Navigate("http://www.163.com");

web.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(web_DocumentCompleted);

void web_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)

{

     WebBrowser web = (WebBrowser)sender;

     HtmlElementCollection ElementCollection = web.Document.GetElementsByTagName("Table");

     foreach (HtmlElement item in ElementCollection)

     {

         File.AppendAllText("Kaijiang_xj.txt", item.InnerText);

     }

}
三、使用HttpWebRequest/HttpWebResponse获取网页内容

这是一种比较通用的获取方式。

[csharp] view plain copy

print ?

public void GetHtml()

     {

         var url = "http://www.360doc.com/content/11/0427/03/1947337_112596569.shtml";

         string strBuff = "";//定义文本字符串，用来保存下载的html

         int byteRead = 0;



         HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(url);

         HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();

         //若成功取得网页的内容，则以System.IO.Stream形式返回，若失败则产生ProtoclViolationException错误。在此正确的做法应将以下的代码放到一个try块中处理。这里简单处理

         Stream reader = webResponse.GetResponseStream();

         ///返回的内容是Stream形式的，所以可以利用StreamReader类获取GetResponseStream的内容，并以StreamReader类的Read方法依次读取网页源程序代码每一行的内容，直至行尾（读取的编码格式：UTF8）

         StreamReader respStreamReader = new StreamReader(reader,Encoding.UTF8);



         ///分段，分批次获取网页源码

         char[] cbuffer = new char[1024];

         byteRead = respStreamReader.Read(cbuffer,0,256);



         while (byteRead != 0)

         {

             string strResp = new string(char,0,byteRead);

             strBuff = strBuff + strResp;

             byteRead = respStreamReader.Read(cbuffer,0,256);

         }

         using (StreamWriter sw = new StreamWriter("e:\ouput.txt"))//将获取的内容写入文本

         {

             htm = sw.ToString();//测试StreamWriter流的输出状态，非必须

             sw.Write(strBuff);

         }

     }
查看全文

相关阅读:
【算法18】重排数组元素使得所有的奇数位于所有偶数之前
 php函数ob_start()、ob_end_clean()、ob_get_contents()
php代码调试
 判断文件存在是用file_exists 还是 is_file
ubuntu屏幕截图工具:scrot,可截鼠标拖曳的矩形区域图形
 mysql常用的技巧
 用户角色权限设计
 解决ubuntu耳机和音箱同时发音
 SSH免密码登录
 IE, Firefox下，checkbox的钩钩一旦勾上，画面再刷新，钩钩还是勾上的解决方案

原文地址：https://www.cnblogs.com/sanler/p/7249312.html

C#网页采集数据的几种方式(WebClient、WebBrowser和HttpWebRequest/HttpWebResponse)

一、通过WebClient获取网页内容

二、通过WebBrowser控件获取网页内容

三、使用HttpWebRequest/HttpWebResponse获取网页内容