zoukankan      html  css  js  c++  java
  • 数据采集类

    爬虫,又称蜘蛛,是从别的网站抓取资源的一种方法,C#.NET使用爬虫的方法如下:

    protected string GetPageHtml(string url)
    {
    string pageinfo;
    try
    {
    WebRequest myreq = WebRequest.Create(url);
    WebResponse myrep = myreq.GetResponse();
    StreamReader reader = new StreamReader(myrep.GetResponseStream(), Encoding.GetEncoding("gb2312"));
    pageinfo = reader.ReadToEnd();
    }
    catch
    {
    pageinfo = "";
    }
    return pageinfo;
    }


    按上述方法就可以在程序中获取某URL的页面源文件。
    但是有些网站屏蔽了爬虫,那就需要模拟浏览器获取的方法来进行,具体代码如下:

    protected string GetPageHtml(string url)
    {
    string pageinfo;
    try
    {
    HttpWebRequest myReq = (HttpWebRequest)HttpWebRequest.Create(url);
    myReq.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
    myReq.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)";
    HttpWebResponse myRep = (HttpWebResponse)myReq.GetResponse();
    Stream myStream = myRep.GetResponseStream();
    StreamReader sr = new StreamReader(myStream, Encoding.Default);
    pageinfo = sr.ReadToEnd().ToString();
    }
    catch
    {
    pageinfo = "";
    }
    return pageinfo;
    }
  • 相关阅读:
    c++获取线程id
    一个基于c++的log库
    防止socket程序重启等待2MSL时间
    c++头文件循环引用
    Myeclipse 8.5 优化设置
    来道题 求解释
    MyEclipse常用设置笔记
    Ubuntu 学习笔记
    Linux 下常用命令
    Oracle 学习笔记 常用查询命令篇
  • 原文地址:https://www.cnblogs.com/yujinchao88/p/3855051.html
Copyright © 2011-2022 走看看