zoukankan      html  css  js  c++  java
  • C# 爬虫获取网页页面几种方式 WebRequest HttpWebRequest WebClient HttpWebRequest

    
    

     1、WebRequest 是System.Net抽象类,子类(HttpWebRequest/HttpWebResponse、FileWebRequest、FtpWebRequest)

    System.Net.WebRequest abstract
    System.Net.HttpWebRequest/HttpWebResponse : WebRequest
    System.Net.FileWebRequest         : WebRequest
    System.Net.FtpWebRequest          : WebRequest

    WebRequest的子类都用于从web获取资源。HttpWebRequest利用HTTP 协议和服务器交互,通常是通过 GET 和 POST 两种方式来对数据进行获取和提交

     1      static void Main(string[] args)
     2         {
     3             // 创建一个WebRequest实例(默认get方式)        
     4             HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.baidu.com");
     5             //可以指定请求的类型
     6             //request.Method = "POST";
     7             HttpWebResponse response = (HttpWebResponse)request.GetResponse();
     8             Console.WriteLine(response.StatusDescription);
     9             // 接收数据
    10             Stream dataStream = response.GetResponseStream();
    11             StreamReader reader = new StreamReader(dataStream);
    12             string responseFromServer = reader.ReadToEnd();
    13             Console.WriteLine(responseFromServer);
    14             // 关闭stream和response
    15             reader.Close();
    16             dataStream.Close();
    17             response.Close();
    18         }

    WebRequest“请求/响应”模型的abstract基类,可以用协议不可知的方式从Internet请求数据
    注意:Create方法将运行时确定的WebRequest类的子类作为与requestUri最接近的注册匹配项返回。例如,当以http://开头的URI在requestUri中传递时,由Create返回一个HttpWebRequest。如果改为传递以file://开头的URI,则Create方法将返回FileWebRequest实例。.NET Framework包括对http://和file:// URI方案的支持。

    get

    var request = WebRequest.Create("http://www.baidu.com");
                request.Method = "GET";
                var response = request.GetResponse();
                using (var stream = new System.IO.StreamReader(response.GetResponseStream()))
                {
                    var content = stream.ReadToEnd();//获取到远程的页面字符串
     
                    Console.WriteLine(content);
                }

    post

     1 var jsonToPost = "{"name":"admin","pwd":"123456"}";
     2             var request = WebRequest.Create("http://www.sina.com");
     3             request.Method = "POST";
     4  
     5             using (var requestStream = request.GetRequestStream())
     6             {
     7                 var bytes = Encoding.UTF8.GetBytes(jsonToPost);
     8                 requestStream.Write(bytes, 0, bytes.Length);
     9             }
    10  
    11             var response = request.GetResponse();
    12             using (var stream = new System.IO.StreamReader(response.GetResponseStream()))
    13             {
    14                 var content = stream.ReadToEnd();//获取 Post 返回的内容
    15             }

    System.Net.HttpWebRequest/HttpWebResponse

     1 HttpWebRequest httpReq;
     2 HttpWebResponse httpResp;
     3 
     4 string strBuff = "";
     5 char[] cbuffer = new char[256];
     6 int byteRead = 0;
     7 
     8 string filename = @"c:log.txt";
     9 ///定义写入流操作
    10 public void WriteStream()
    11 {
    12 Uri httpURL = new Uri(txtURL.Text);
    13 
    14 ///HttpWebRequest类继承于WebRequest,并没有自己的构造函数,需通过WebRequest的Creat方法 建立,并进行强制的类型转换
    15       httpReq = (HttpWebRequest)WebRequest.Create(httpURL);
    16 ///通过HttpWebRequest的GetResponse()方法建立HttpWebResponse,强制类型转换
    17 
    18    httpResp = (HttpWebResponse) httpReq.GetResponse();
    19 ///GetResponseStream()方法获取HTTP响应的数据流,并尝试取得URL中所指定的网页内容
    20 
    21      ///若成功取得网页的内容,则以System.IO.Stream形式返回,若失败则产生ProtoclViolationException错 误。在此正确的做法应将以下的代码放到一个try块中处理。这里简单处理
    22 Stream respStream = httpResp.GetResponseStream();
    23 
    24 ///返回的内容是Stream形式的,所以可以利用StreamReader类获取GetResponseStream的内容,并以
    25 
    26 StreamReader类的Read方法依次读取网页源程序代码每一行的内容,直至行尾(读取的编码格式:UTF8)
    27 StreamReader respStreamReader = new StreamReader(respStream,Encoding.UTF8);
    28 
    29 byteRead = respStreamReader.Read(cbuffer,0,256);
    30 
    31 while (byteRead != 0)
    32 {
    33 string strResp = new string(cbuffer,0,byteRead);
    34                   strBuff = strBuff + strResp;
    35                   byteRead = respStreamReader.Read(cbuffer,0,256);
    36 }
    37 
    38 respStream.Close();
    39 txtHTML.Text = strBuff;
    40 }

    2、System.Net.WebClient

    WebClient很轻量级的访问Internet资源的类,在指定uri后可以发送和接受数据。WebClient提供了 DownLoadData,DownLoadFile,UploadData,UploadFile 方法,同时通过了这些方法对应的异步方法,通过WebClient我们可以很方便地上传和下载文件。

    static void Main(string[] args)
    {
                WebClient wc = new WebClient();
                wc.BaseAddress = "http://www.baidu.com/";   //设置根目录
                wc.Encoding = Encoding.UTF8;                //设置按照何种编码访问,如果不加此行,获取到的字符串中文将是乱码
                string str = wc.DownloadString("/");        //字符串形式返回资源
                Console.WriteLine(str);
    
    
                //----------------------以下为OpenRead()以流的方式读取----------------------
                wc.Headers.Add("Accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*");
                wc.Headers.Add("Accept-Language", "zh-cn");
                wc.Headers.Add("UA-CPU", "x86");
                //wc.Headers.Add("Accept-Encoding","gzip, deflate");    //因为我们的程序无法进行gzip解码所以如果这样请求获得的资源可能无法解码。当然我们可以给程序加入gzip处理的模块 那是题外话了。
                wc.Headers.Add("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)");
                //Headers   用于添加添加请求的头信息
                Stream objStream = wc.OpenRead("?tn=98050039_dg&ch=1");             //获取访问流
                StreamReader _read = new StreamReader(objStream, Encoding.UTF8);    //新建一个读取流,用指定的编码读取,此处是utf-8
                Console.Write(_read.ReadToEnd());                                   //输出读取到的字符串
    
                //------------------------DownloadFile下载文件-------------------------------
                wc.DownloadFile("http://www.baidu.com/img/shouye_b5486898c692066bd2cbaeda86d74448.jpg", @"D:123.jpg"); //将远程文件保存到本地 
    
                //------------------------DownloadFile下载到字节数组------------------------------- 
                 byte[] bytes = wc.DownloadData("http://www.baidu.com/img/shouye_b5486898c692066bd2cbaeda86d74448.gif"); 
                 FileStream fs = new FileStream(@"E:123.gif", FileMode.Create); 
                 fs.Write(bytes, 0, bytes.Length); fs.Flush(); 
                 WebHeaderCollection whc = wc.ResponseHeaders; 
                 //获取响应头信息 
                 foreach (string s in whc) {
                    Console.WriteLine(s + ":" + whc.Get(s)); 
                   }  
                 Console.ReadKey(); 
    }
     1 WebClient MyWebClient = new WebClient();        
     2         MyWebClient.Credentials = CredentialCache.DefaultCredentials;//获取或设置用于向Internet资源的请求进行身份验证的网络凭据
     3         Byte[] pageData = MyWebClient.DownloadData("http://www.163.com"); //从指定网站下载数据
     4         string pageHtml = Encoding.Default.GetString(pageData);  //如果获取网站页面采用的是GB2312,则使用这句    
     5         //string pageHtml = Encoding.UTF8.GetString(pageData); //如果获取网站页面采用的是UTF-8,则使用这句
     6         Console.WriteLine(pageHtml);//在控制台输入获取的内容
     7         using (StreamWriter sw = new StreamWriter("c:\test\ouput.html"))//将获取的内容写入文本
     8         {
     9             sw.Write(pageHtml);
    10         }

    System.Net.Http.HttpClient

      HttpClient是.NET4.5引入的一个HTTP客户端库,其命名空间为 System.Net.Http 。.NET 4.5之前我们可能使用WebClient和HttpWebRequest来达到相同目的。HttpClient利用了最新的面向任务模式,使得处理异步请求非常容易。

     下边是一个使用控制台程序异步请求接口的栗子:

     1 static void Main(string[] args)
     2         {
     3             const string GetUrl = "http://xxxxxxx/api/UserInfo/GetUserInfos";//查询用户列表的接口,Get方式访问
     4             const string PostUrl = "http://xxxxxxx/api/UserInfo/AddUserInfo";//添加用户的接口,Post方式访问
     5 
     6             //使用Get请求
     7             GetFunc(GetUrl);
     8 
     9             UserInfo user = new UserInfo { Name = "jack", Age = 23 };
    10             string userStr = JsonHelper.SerializeObject(user);//序列化
    11             //使用Post请求
    12             PostFunc(PostUrl, userStr);
    13             Console.ReadLine();
    14         }
    15 
    16         /// <summary>
    17         /// Get请求
    18         /// </summary>
    19         /// <param name="path"></param>
    20         static async void GetFunc(string path)
    21         {
    22             //消息处理程序
    23             HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.GZip };
    24             HttpClient httpClient = new HttpClient();
    25             //异步get请求
    26             HttpResponseMessage response = await httpClient.GetAsync(path);
    27             //确保响应正常,如果响应不正常EnsureSuccessStatusCode()方法会抛出异常
    28             response.EnsureSuccessStatusCode();
    29             //异步读取数据,格式为String
    30             string resultStr = await response.Content.ReadAsStringAsync();
    31             Console.WriteLine(resultStr);
    32         }
    33 
    34         /// <summary>
    35         /// Post请求
    36         /// </summary>
    37         /// <param name="path"></param>
    38         /// <param name="data"></param>
    39         static async void PostFunc(string path, string data)
    40         {
    41             HttpClientHandler handler = new HttpClientHandler() { AutomaticDecompression = DecompressionMethods.GZip };
    42             HttpClient httpClient = new HttpClient(handler);
    43             //HttpContent是HTTP实体正文和内容标头的基类。
    44             HttpContent httpContent = new StringContent(data, Encoding.UTF8, "text/json");
    45             //httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("BasicAuth", Ticket);//验证请求头赋值
    46             //httpContent.Headers.Add(string name,string value) //添加自定义请求头
    47 
    48             //发送异步Post请求
    49             HttpResponseMessage response = await httpClient.PostAsync(path, httpContent);
    50             response.EnsureSuccessStatusCode();
    51             string resultStr = await response.Content.ReadAsStringAsync();
    52             Console.WriteLine(resultStr);
    53         }
    54     }

    注意:因为HttpClient有预热机制,第一次进行访问时比较慢,所以我们最好不要用到HttpClient就new一个出来,应该使用单例或其他方式获取HttpClient的实例。上边的栗子为了演示方便直接new的HttpClient实例。

    HttpClient还有很多其他功能,如附带Cookie,请求拦截等,可以参考https://www.cnblogs.com/wywnet/p/httpclient.html

     using (var http = new HttpClient())
     {
              var content= http.GetAsync("http://www.baidu.com").GetAwaiter().GetResult()
                      .Content.ReadAsStringAsync().GetAwaiter().GetResult();
     }

    post

    1 using (var http = new HttpClient())
    2             {
    3                 var jsonToPost = "{"name":"admin","pwd":"123456"}";
    4                 var content = http.PostAsync("http://www.baidu.com", new StringContent(jsonToPost)).GetAwaiter().GetResult()
    5                    .Content.ReadAsStringAsync().GetAwaiter().GetResult();
    6             }

     WebBrowser

     1 WebBrowser web = new WebBrowser(); 
     2 web.Navigate("http://www.xxx.com/ssc/"); 
     3 web.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(web_DocumentCompleted); 
     4 void web_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e) 
     5         { 
     6             WebBrowser web = (WebBrowser)sender; 
     7             HtmlElementCollection ElementCollection = web.Document.GetElementsByTagName("Table"); 
     8             foreach (HtmlElement item in ElementCollection) 
     9             { 
    10                  File.AppendAllText("Kaijiang_xj.txt", item.InnerText); 
    11             } 
    12         }

     

    4. 三种方法的简单比较:

    1)。WebRequest 和httpresponse最简单直接。

    2)。WebClient对WebRequest作了包装,可以用于上传与下载文件,使用起来方便。 但是如果需要设置httpRequest的一些属性,如timeout,cache-level,则没有办法做到。需要用户重载。

    3)。WebBrowser 最强大,但是耗资源最多。集成了Js引擎,依赖于OS的IE内核,能自动执行返回结果中的JS脚本。但是,一般只能用于winForm程序中。 如果需要在console程序中WebBrowser,请参考:

    5. web blogs

    WebBrowser is actually in the System.Windows.Forms namespace and is avisual control that you can add to a form. It is primarily a wrapper around theInternet Explorer browser (MSHTML). It allows you to easily display andinteract programmatically with a web page. You call the Navigate method passinga web URL, wait for it to complete downloading and display and then interactwith the page using the object model it provides.

    HttpWebRequest is a concrete class that allows you to request in code anysort of file over HTTP. You usually receive it as a stream of bytes. What youdo with it after that is up to your application.

    HttpWebResponse allows you to process the response from a web server thatwas previously requested using HttpWebRequest.

    WebRequest and WebResponse are the abstract base classes that theHttpWebRequest and HttpWebResponse inherit from. You can't create thesedirectly. Other classes that inherit from these include Ftp and File classes.

    WebClient I have always seen as a nice helper class that provides simplerways to, for example, download or upload a file from a web url. (egDownloadFile and DownloadString methods). I have heard that it actually usesHttpWebRequest / HttpWebResponse behind the scenes for certain methods.

    If you needmore fine grained control over web requests and responses, HttpWebRequest /HttpWebResponse are probably the way to go. Otherwise WebClient is generallysimpler and will do the job.

    1). http://www.pin5i.com/showtopic-24684.html

    2). http://hi.baidu.com/javaecho/blog/item/079c6d2a0d4efd5d4fc226b1.html

     
     
     

    System.Net

     ============================================

    Html Agility Pack

    Install-Package HtmlAgilityPack

    以指定的Stream对象为主的有:

    (1)public void Load(Stream stream)    ///从指定的Stream对象中加载html;

    (2)public void Load(Stream stream, bool detectEncodingFromByteOrderMarks)    ///指定是否从顺序字节流中解析编码格式

    (3)public void Load(Stream stream, Encoding encoding)    ///指定编码格式

    (4)public void Load(Stream stream, Encoding encoding, bool detectEncodingFromByteOrderMarks)

    (5)public void Load(Stream stream, Encoding encoding, bool detectEncodingFromByteOrderMarks, int buffersize)

    以指定的物理路径为主的有:

     (1)public void Load(string path)

    (2)public void Load(string path, bool detectEncodingFromByteOrderMarks)    ///指定是否从顺序字节流中解析编码格式

    (3)public void Load(string path, Encoding encoding)    ///指定编码格式

    (4)public void Load(string path, Encoding encoding, bool detectEncodingFromByteOrderMarks)

    (5)public void Load(string path, Encoding encoding, bool detectEncodingFromByteOrderMarks, int buffersize)

    HtmlDocument doc =new HtmlDocument();
    string html ="<div id="demo"><span style="color:red;"><h1>Hello World!</h1></span></div>";
    doc.LoadHtml(html);
    HtmlNode node = doc.GetElementbyId("title");
    string titleValue = node.Attributes["title"].Value;
    foreach(HtmlAttribute attr in node.Attributes)
    {
    Console.WriteLine("{0}={1}",attr.Name,attr.Value);
    }
    HtmlDocument doc =new HtmlDocument();

    string html ="<div id="demo"><span style="color:red;"><h1>Hello World!</h1></span></div>";

    doc.LoadHtml(html);


    HtmlNode node = doc.HtmlDocument;

    Console.WriteLine(node.OuterHtml); /// return "<div id="demo"><span style="color:red;"><h1>Hello World!</h1></span></div>";
    Console.WriteLine(node.InnerHtml); /// return "<span style="color:red;"><h1>Hello World!</h1></span>


    获取父节点的系列方法:

    1)public IEnumerable<HtmlNode> Ancestors()

    获取当前节点的父节点列表(不包含自身)。

    2)public IEnumerable<HtmlNode> Ancestors(string name)

    以指定一个名称来获取父节点的列表(不包含自身)。

    3)public IEnumerable<HtmlNode> AncestorsAndSelf()

    获取当前节点的父节点列表(包含自身)。

    4)public IEnumerable<HtmlNode> AncestorsAndSelf(string name)

    以指定一个名称来获取父节点的列表(包含自身)。

    获取子节点的系列方法:

    1)public IEnumerable<HtmlNode> DescendantNodes()

    获取当前节点下的所有子节点的列表,包括子节点的子节点(不包含自身)。

    2)public IEnumerable<HtmlNode> DescendantNodesAndSelf()

    获取当前节点下的所有子节点的列表,包括子节点的子节点(包含自身)。

    3)public IEnumerable<HtmlNode> Descendants()

    获取当前节点下的直接子节点的列表(不包含自身)。

    4)public IEnumerable<HtmlNode> DescendantsAndSelf()

    获取当前节点下的直接子节点的列表(包含自身)。

    5)public IEnumerable<HtmlNode> Descendants(string name)

    获取当前节点下的以指定名称的子节点列表。

    6)public IEnumerable<HtmlNode> DescendantsAndSelf(string name)

    获取当前节点下的以指定名称的子节点的列表(包含自身)。

    7)public HtmlNode Element(string name)

    获取第一个符合指定名称的直接子节点的节点元素。

    8)public IEnumerable<HtmlNode> Elements(string name)

    获取符合指定名称的所有直接子节点的节点列表。

    9)public HtmlNodeCollection SelectNodes(string xpath)

    获取符合指定的xpath的子节点列表。

    10)public HtmlNode SelectSingleNode(string xpath)

    获取符合指定的xpath的单个字节点元素。

  • 相关阅读:
    回调函数
    未能正确加载“Microsoft.VisualStudio.Editor.Implementation.EditorPackage”包
    顶帽变化(转载)
    协程 + asyncio
    docker
    vue+uwsgi+nginx部署前后端分离项目
    html
    关于html的基础标签
    关于python中的GIL
    hashlib模块
  • 原文地址:https://www.cnblogs.com/mingjing/p/13509438.html
Copyright © 2011-2022 走看看