网络上能搜索到的爬虫文章大多是用python做的,也有少部分是C#做的(小声:所以用VB.NET也可以做爬虫.本文写的是第一步:获取网页)
使用代码前先imports以下内容
Imports System.IO, System.IO.Compression, System.Text, System.Net
写程序前先开浏览器(我用的Chrome),随便上个网页,F12看下header,粘下来useragent备用,也可以粘下accept,cookie等(在本文中用不到
用httpwebrequest建立请求,用httpwebresponse得到响应体.然后考虑下压缩的问题(imports System.IO.Compression就是解决这个的)
最后得到真正的返回流,streamreader读取之,然后网页的http代码就搞下来了.用这种方法可以搞定编码为UTF-8的网页对于编码是GB2312或GBK的需有改动:使用streamreader时第二个参数改为Encoding.GetEncoding("gbk")
下面是代码:
1 Public Function GetHttpContent(url As String) As String 2 Try 3 Dim req As HttpWebRequest = HttpWebRequest.CreateHttp(url), resp As HttpWebResponse, sol$ 4 With req 5 .UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36" 6 .Accept = "*/*" 7 .Method = "GET" 8 .Timeout = 300000 9 .Headers.Add("accept-encoding", " gzip, deflate") 10 End With 11 resp = req.GetResponse 12 Select Case resp.ContentEncoding.ToLower 13 Case "gzip" 14 Using z As New GZipStream(resp.GetResponseStream, CompressionMode.Decompress) 15 Using sr As New StreamReader(z, Encoding.UTF8) 16 sol = sr.ReadToEnd 17 End Using 18 End Using 19 Exit Select 20 Case "deflate" 21 Using z As New DeflateStream(resp.GetResponseStream, CompressionMode.Decompress) 22 Using sr As New StreamReader(z, Encoding.UTF8) 23 sol = sr.ReadToEnd 24 End Using 25 End Using 26 Exit Select 27 Case Else 28 Using sr As New StreamReader(resp.GetResponseStream, Encoding.UTF8) 29 sol = sr.ReadToEnd 30 End Using 31 Exit Select 32 End Select 33 Return sol 34 Catch ex As Exception 35 Return "" 36 End Try 37 End Function
(本人水平有限,代码有不完善的地方欢迎指出