使用fizzler [HtmlAgilityPackExtension]和c#进行网页数据提取;fizzler是HtmlAgilityPack的一个扩展,支持jQuery Selector;
提取数据一般都是有规律url拼凑,然后挨个儿发request得到response进行解析:
1.假如一个website下的所有xxx.sample.com/contactus.html里边存在邮箱字段(准备提取的数据)
a)当有子域名的时候,比如:a.sample.com, aadr.sample.com, 135dj.sample.com,随机性比较强;
解决方法:bing search engine中使用 site:b2b.sample.com搜索得到的result页面可以提取所有子域名,然后拼凑成xxx.sample.com/contactus.html,继而发送请求到这个url,得 到response进行解析;
NOTE:关于site:b2b.sample.com的搜索url拼凑如下,
http://www.bing.com/search?q=site%3A{b2b.sample.com}&go=Submit&qs=n&form=QBRE&pq=site%3A{b2b.sample.com}&sc=1-19&sp=-1&sk=&cvid=6165a189f5354b1982fb8cd6933abb6f&first={pageIndex}&FORM=PERE
2.像www.sample.com/1456.html的页面可以直接平凑1456.html/1457.html/1458.html etc.此处不列举;
Fizzler使用方法:
1.从nuget上安装Fizzler;
2.使用方法参考code.google.com;
3.使用bing提取website下的所有子域:
private static List<string> GetSubdomains(string websiteDomain, int startPageIndex = 1, int pageCount = 999, int pageSize = 14) { var list = new List<string>(); //using bind to search subdomains in a certain website var bingSearchUrlFormat = "http://www.bing.com/search?q=site%3a{0}&go=Submit&qs=n&pq=site%3a{0}&sc=1-100&sp=-1&sk=&cvid=a9b36439006f4b05b09f9202c5b784bd&first={1}&FORM=PQRE"; WebClient client = new WebClient(); client.Encoding = Encoding.UTF8; var doc = new HtmlDocument(); var first = (startPageIndex / 10) * 140 + 1; var stopIndex = first + pageCount*pageSize; var currentPageIndex = startPageIndex; for (var startItemSquenceNumber = first; startItemSquenceNumber < stopIndex; startItemSquenceNumber = startItemSquenceNumber + pageSize) { var response = client.DownloadString(string.Format(bingSearchUrlFormat, websiteDomain, startItemSquenceNumber)); HtmlDocumentExtensions.LoadHtml2(doc, response); var docNode = doc.DocumentNode; var subDomains = docNode.QuerySelectorAll(".sb_meta cite");foreach (var subDomain in subDomains) { list.Add(subDomain.InnerText); } }return list; }
4.获取网页节点:
private static List<HtmlNode> GetWebPageNodes(string url, string elementSelector, string attributeNameContained, string attributeNameContainedValueLike) { var client = new WebClient(); client.Encoding = Encoding.UTF8; var response = client.DownloadString(url); var doc = new HtmlDocument(); HtmlDocumentExtensions.LoadHtml2(doc, response); var docNode = doc.DocumentNode; var emailNode = docNode.QuerySelectorAll(elementSelector).Where(node => node.Attributes.Where(attr => attr.Name == attributeNameContained).FirstOrDefault().Value.Contains(attributeNameContainedValueLike)).FirstOrDefault(); var nodes = (from node in docNode.QuerySelectorAll(elementSelector) where node.HasAttributes && node.GetAttributeValue(attributeNameContained, string.Empty).Contains(attributeNameContainedValueLike) select node).ToList(); return nodes; }
5.获取某个网页中邮箱的方法:
var subdomains = GetSubdomains("b2b.sample.com", stopPageIndex, 10); var urlFormat = "http://{0}/contactus.html"; GetWebPageNodes(string.Format(urlFormat, item), "body table a", "href", "mailto").FirstOrDefault();
最后的问题:当通过bing搜索子域时会有限制,发送100~150个请求后获取到的response就不是我想要的页面,而是要求输入验证码防止攻击的html;此问题暂时未解决,望大神指点!