zoukankan      html  css  js  c++  java
  • 通过HtmlAgilityPack插件和xpath解析html完成爬虫抓取数据

    爬虫抓取数据的思路是,根据url地址去获取html,然后解析html,取出需要的数据

    首先需要引入HtmlAgilityPack的dll(下载HtmlAgilityPack.dll

    主要是使用HtmlDocument类来加载获取到的html代码,转换为HtmlDocument对象操作

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);

    获取到HtmlDocument对象以后,根据xpath过滤出对应的节点

    //找到class=zg_itemImmersion的div节点

    string xpathDiv = "//div[@class='zg_itemImmersion']";
    HtmlNodeCollection allDivs = doc.DocumentNode.SelectNodes(xpathDiv);

    xpath语法可以自行网上查找,简单实用,很好理解

    完整代码如下:

    public static void GetData(string url, ref DataTable dt)
            {
                try
                {
                    //WebClient获取Amazon的html会返回校验页面的html
                    //WebClient wc = new WebClient();
                    //string html = wc.DownloadString(url);
    
                    //HtmlWeb方式获取html,获取多次以后,后续会加载不到html
                    //HtmlWeb web = new HtmlWeb();
                    //HtmlAgilityPack.HtmlDocument doc = web.Load(url);
                    //找到排行的每个商品节点
    
                    //通过HttpWebRequest方式获取html
                    string html = WebRequestPost(url);
                    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
                    doc.LoadHtml(html);
                    string xpathDiv = "//div[@class='zg_itemImmersion']";//找到class=zg_itemImmersion的div节点
                    HtmlNodeCollection allDivs = doc.DocumentNode.SelectNodes(xpathDiv);
                    for (int i = 0; i < allDivs.Count; i++)
                    {
                        if (i > 2) break;
    
                        //需要把allDivs里面的node重新转换为HtmlNode才能取到对应节点的信息,否则取到的一直都是第一个
                        HtmlNode node = HtmlNode.CreateNode(allDivs[i].InnerHtml);
                        DataRow dr = dt.NewRow();
    
                        //过滤商品排名
                        string xpath = "//span[@class='zg_rankNumber']";//找到class=zg_rankNumber的span节点
                        string indexText = node.SelectSingleNode(xpath).InnerText.Replace(".", "").Replace("
    ", "").TrimStart().TrimEnd();
                        int rank = int.Parse(indexText);
                        dr["排名"] = rank;
    
                        //过滤商品名称
                        xpath = "//div[@class='p13n-sc-truncate p13n-sc-truncated-hyphen p13n-sc-line-clamp-2']";//找到对应class的div节点
                        string name = node.SelectSingleNode(xpath).InnerText.Replace("
    ", "").TrimStart().TrimEnd();
                        dr["商品名称"] = name;
    
                        //过滤商品价格
                        xpath = "//span[@class='p13n-sc-price']";//找到class=p13n-sc-price的span节点
                        string price = node.SelectSingleNode(xpath).InnerText.Replace("
    ", "");
                        dr["售价"] = price;
    
                        //过滤商品明细连接,position()从1开始
                        xpath = "//a[@class='a-link-normal' and position()=1]";//找到class=a-link-normal的并且位置是第一个的a节点
                        string href = node.SelectSingleNode(xpath).Attributes["href"].Value;
                        href = "https://www.amazon.com" + href;
                        string htmlDetail = WebRequestPost(href);
                        HtmlAgilityPack.HtmlDocument docDetail = new HtmlAgilityPack.HtmlDocument();
                        docDetail.LoadHtml(htmlDetail);
                        xpath = "//div[@id='detailBulletsWrapper_feature_div']";//找到id=detailBulletsWrapper_feature_div的div节点
                        HtmlNode nodeDetail = docDetail.DocumentNode.SelectSingleNode(xpath);
                        if (nodeDetail != null)
                        {
                            //过滤商品首次上架日期节点
                            //xpath = "//li[position()=5]//span[position()=2]";//不能直接取固定位置的li,因为有些商品的li数量不一致有些5个,有些6个
                            //找到包含有Date first available at Amazon.com文本内容的span节点的第一个span兄弟节点
                            xpath = "//span[contains(text(), 'Date first available at Amazon.com')]/following-sibling::span[1]";
                            string dateFrist = nodeDetail.SelectSingleNode(xpath).InnerText;
                            dr["首次上架日期"] = dateFrist;
    
                            //过滤商品分类排名信息
                            xpath = "//li[@id='SalesRank']/b/following::text()[1]";//找到id=SalesRank的li节点里面b节点相邻的第一个文本节点
                            string categoryRank = nodeDetail.SelectSingleNode(xpath).InnerText.Replace("(", "");//获取主分类排名
                            xpath = "//li[@id='SalesRank']/ul[@class='zg_hrsr']";//找到id=SalesRank的li节点里面class=zg_hrsr的ul节点
                            string detailRank = nodeDetail.SelectSingleNode(xpath).InnerText.Replace("&nbsp;", " ").Replace("&gt;", ">");//获取具体分类排名
                            dr["排名信息"] = categoryRank + detailRank;
                        }
                        dt.Rows.Add(dr);
                    }
                }
                catch (Exception ex)
                {
                    MessageBox.Show("爬虫抓取失败,失败信息:" + ex.Message);
                }
            }
    View Code

    最后将html里面的内容解析出来以后,添加到DataTable,然后再导入到Excel

  • 相关阅读:
    HDU 2089 不要62
    HDU 5038 Grade(分级)
    FZU 2105 Digits Count(位数计算)
    FZU 2218 Simple String Problem(简单字符串问题)
    FZU 2221 RunningMan(跑男)
    FZU 2216 The Longest Straight(最长直道)
    FZU 2212 Super Mobile Charger(超级充电宝)
    FZU 2219 StarCraft(星际争霸)
    FZU 2213 Common Tangents(公切线)
    FZU 2215 Simple Polynomial Problem(简单多项式问题)
  • 原文地址:https://www.cnblogs.com/zfylzl/p/6949954.html
Copyright © 2011-2022 走看看