zoukankan      html  css  js  c++  java
  • c# & Fizzler to crawl web page in a certain website domain

    使用fizzler [HtmlAgilityPackExtension]和c#进行网页数据提取;fizzler是HtmlAgilityPack的一个扩展,支持jQuery Selector;

    提取数据一般都是有规律url拼凑,然后挨个儿发request得到response进行解析:

    1.假如一个website下的所有xxx.sample.com/contactus.html里边存在邮箱字段(准备提取的数据)

      a)当有子域名的时候,比如:a.sample.com, aadr.sample.com, 135dj.sample.com,随机性比较强;

       解决方法:bing search engine中使用 site:b2b.sample.com搜索得到的result页面可以提取所有子域名,然后拼凑成xxx.sample.com/contactus.html,继而发送请求到这个url,得        到response进行解析;

       NOTE:关于site:b2b.sample.com的搜索url拼凑如下,

           http://www.bing.com/search?q=site%3A{b2b.sample.com}&go=Submit&qs=n&form=QBRE&pq=site%3A{b2b.sample.com}&sc=1-19&sp=-1&sk=&cvid=6165a189f5354b1982fb8cd6933abb6f&first={pageIndex}&FORM=PERE

    2.像www.sample.com/1456.html的页面可以直接平凑1456.html/1457.html/1458.html etc.此处不列举;

    Fizzler使用方法:

    1.从nuget上安装Fizzler

    2.使用方法参考code.google.com

    3.使用bing提取website下的所有子域:

    private static List<string> GetSubdomains(string websiteDomain, int startPageIndex = 1, int pageCount = 999, int pageSize = 14)
            {
                var list = new List<string>();
                //using bind to search subdomains in a certain website
                var bingSearchUrlFormat = "http://www.bing.com/search?q=site%3a{0}&go=Submit&qs=n&pq=site%3a{0}&sc=1-100&sp=-1&sk=&cvid=a9b36439006f4b05b09f9202c5b784bd&first={1}&FORM=PQRE";
    
                WebClient client = new WebClient();
                client.Encoding = Encoding.UTF8;
                var doc = new HtmlDocument();
    
                var first = (startPageIndex / 10) * 140 + 1;
                var stopIndex = first + pageCount*pageSize;
                var currentPageIndex = startPageIndex;
                for (var startItemSquenceNumber = first; startItemSquenceNumber < stopIndex; startItemSquenceNumber = startItemSquenceNumber + pageSize)
                {
                    var response = client.DownloadString(string.Format(bingSearchUrlFormat, websiteDomain, startItemSquenceNumber));
                    HtmlDocumentExtensions.LoadHtml2(doc, response);
                    var docNode = doc.DocumentNode;
                    var subDomains = docNode.QuerySelectorAll(".sb_meta cite");foreach (var subDomain in subDomains)
                    {
                        list.Add(subDomain.InnerText);
                    }
                }return list;
            }

    4.获取网页节点:

            private static List<HtmlNode> GetWebPageNodes(string url, string elementSelector, string attributeNameContained, string attributeNameContainedValueLike)
            {
                var client = new WebClient();
                client.Encoding = Encoding.UTF8;
                var response = client.DownloadString(url);
                var doc = new HtmlDocument();
                HtmlDocumentExtensions.LoadHtml2(doc, response);
                var docNode = doc.DocumentNode;
                var emailNode = docNode.QuerySelectorAll(elementSelector).Where(node => node.Attributes.Where(attr => attr.Name == attributeNameContained).FirstOrDefault().Value.Contains(attributeNameContainedValueLike)).FirstOrDefault();
    
                var nodes = (from node in docNode.QuerySelectorAll(elementSelector)
                             where node.HasAttributes && node.GetAttributeValue(attributeNameContained, string.Empty).Contains(attributeNameContainedValueLike)
                             select node).ToList();
    
                return nodes;
            }

    5.获取某个网页中邮箱的方法:

    var subdomains = GetSubdomains("b2b.sample.com", stopPageIndex, 10);
    var urlFormat = "http://{0}/contactus.html";
    GetWebPageNodes(string.Format(urlFormat, item), "body table a", "href", "mailto").FirstOrDefault();

    最后的问题:当通过bing搜索子域时会有限制,发送100~150个请求后获取到的response就不是我想要的页面,而是要求输入验证码防止攻击的html;此问题暂时未解决,望大神指点!

  • 相关阅读:
    hdu2438 三分
    hdu3786 找出直系亲属 水题
    hdu3786 找出直系亲属 水题
    hdu4561 连续最大积
    hdu4561 连续最大积
    hdu4604 不错的子序列问题
    hdu4604 不错的子序列问题
    hdu4450 不错的贪心
    hdu1722 切蛋糕
    hdu3768 spfa+全排列
  • 原文地址:https://www.cnblogs.com/paul-cheung/p/csharp-fizzler-to-crawl-web-page-in-a-certain-website-domain.html
Copyright © 2011-2022 走看看