zoukankan      html  css  js  c++  java
  • 基于.net的爬虫应用-DotnetSpider

            最近应朋友的邀请,帮忙做了个简单的爬虫程序,要求不高,主要是方便对不同网站的爬取进行扩展,获取到想要的数据信息即可。当然,基于数据的后期分析功能是后话,以后的随笔我会逐步的介绍。

            开源的爬虫框架比较多,之前我研究过java的nutch,同时它还兼备基于Lucene全文检索的功能,还有Python爬虫等等。为什么我会选择用DotnetSpider呢,我之前有使用.net开发过一套分布式框架,框架的实现机制和DotnetSpider有相似之处,所以上手之后,甚是喜欢。

            先看下解决方案的整体分层情况:

    InternetSpider:控制台程序,后续可以服务的方式部署在windows环境中

    ISee.Shaun.Spiders.Business:爬虫程序的中心调度层,负责爬虫的配置,启动,执行等

    ISee.Shaun.Spiders.Common:通用类,包括反射代码、大众点评的数据字典、回调委托的定义等

    ISee.Shaun.Spiders.Pipeline:BasePipeline的实现层,主要实现了数据保存

    ISee.Shaun.Spiders.Processor:BasePageProcessor的实现层,主要实现了通过xpath的数据提取

    ISee.Shaun.Spiders.SpiderModel:数据模型层,负责实体定义和EF数据操作

    以爬取大众点评湘菜数据为例,程序的执行过程如下:

    InternetSpider读取配置文件,获取需要爬取的URL地址,大众点评数据分页仅支持50页,所以,需要获取更多数据我们需要将检索条件进行细化,观察后发现通过重点地区进行爬取,效果尚可,地址为http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}。

    图一:湘菜检索地址

    图二:分类检索地址,共11页

    那么行政区地址从哪里来的呢?我们直接使用谷歌浏览器,代码全在里面了

    字典直接附上:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    
    namespace ISee.Shaun.Spiders.Common
    {
        public static class DazhongdianpingArea
        {
            private static Dictionary<string, string> areaDic = null;
            public static Dictionary<string, string> GetAreaDic()
            {
                if (areaDic == null)
                {
                    areaDic = new Dictionary<string, string>();
                    areaDic.Add("r16", "西城区");
                    areaDic.Add("r15", "东城区");
                    areaDic.Add("r17", "海淀区");
                    areaDic.Add("r328", "石景山区");
                    areaDic.Add("r14", "朝阳区");
                    areaDic.Add("r20", "丰台区");
                    areaDic.Add("r9158", "顺义区");
                    areaDic.Add("r5950", "昌平区");
                    areaDic.Add("r5952", "大兴区");
                    areaDic.Add("r9157", "房山区");
                    areaDic.Add("r5951", "通州区");
                    areaDic.Add("c4453", "怀柔区");
                    areaDic.Add("c435", "延庆区");
                    areaDic.Add("c434", "密云区");
                    areaDic.Add("c4454", "门头沟区");
                    areaDic.Add("c4455", "平谷区");
                }
                return areaDic;
            }
        }
    }

    OK,在看一下配置文件,对应好需要的地址 

    <?xml version="1.0" encoding="utf-8"?>
    <configuration>
      <configSections>
        <!-- For more information on Entity Framework configuration, visit http://go.microsoft.com/fwlink/?LinkID=237468 -->
        <section name="entityFramework" type="System.Data.Entity.Internal.ConfigFile.EntityFrameworkSection, EntityFramework, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" requirePermission="false" />
      </configSections>
      <appSettings>
        <!-- 大分类抓取地址,共五十页 -->
        <add key="WebUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/p{0}" />
        <!-- 细化后地址,加入了地区 -->
        <add key="WebAreaUrls" value="http://www.dianping.com/search/keyword/2/10_湖南菜/{0}p{1}" />
      </appSettings>
      <startup>
        <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6.1" />
      </startup>
      <connectionStrings>
        <!-- 数据库连接字符串 -->
        <add name="ConnectionStr" connectionString="data source=.;initial catalog=Membership_Spider;integrated security=True;user id=sa;password=123asd!@#;multipleactiveresultsets=True;" providerName="System.Data.SqlClient" />
      </connectionStrings>
      <entityFramework>
        <defaultConnectionFactory type="System.Data.Entity.Infrastructure.LocalDbConnectionFactory, EntityFramework">
          <parameters>
            <parameter value="mssqllocaldb" />
          </parameters>
        </defaultConnectionFactory>
        <providers>
          <provider invariantName="System.Data.SqlClient" type="System.Data.Entity.SqlServer.SqlProviderServices, EntityFramework.SqlServer" />
        </providers>
      </entityFramework>
    </configuration>

    获取到页面地址后,我们需要初始化爬虫服务,我定义了一个RunSpider,初始化时,传递Processor和Pipeline实现类字符串,编码格式等。直接调用run方法,开始执行。

     1 using ISee.Shaun.Spiders.Business;
     2 using ISee.Shaun.Spiders.Common;
     3 using System;
     4 using System.Collections.Generic;
     5 using System.Configuration;
     6 using System.Linq;
     7 using System.Text;
     8 using System.Threading.Tasks;
     9 
    10 namespace InternetSpider
    11 {
    12     class Program
    13     {
    14         private static string urlInfo = ConfigurationManager.AppSettings["WebUrls"];
    15         private static string urlAreaInfo = ConfigurationManager.AppSettings["WebAreaUrls"];
    16         static void Main(string[] args)
    17         {
    18             Run();
    19         }
    20 
    21         /// <summary>
    22         /// Begin spider
    23         /// </summary>
    24         private static void Run()
    25         {
    26             //Add other areaInfo
    27             Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic();
    28             List<string> urls = new List<string>();
    29             foreach (var key in areaDic.Keys)
    30             {
    31                 for (int i = 1; i <= 50; i++)
    32                 {
    33                     urls.Add(string.Format(urlAreaInfo, key, i));
    34                 }
    35             }
    36             RunSpider runSpiders = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true);
    37             runSpiders.Run(urls);
    38 
    39             //RunSpider runSpider = new RunSpider("DazhongdianpingProcessor", "DazhongdianpingPipeline", "UTF-8", true);
    40             //runSpider.Run(urlInfo, 50);
    41         }
    42     }
    43 }

     关于RunSpider,我不在重复说明,请看代码注释(RunSpider类的主要功能就是方便新任务的开启,或者不通域名下站点的调用,或者说我这里的委托中开启的子页面调用等;反射的使用,便于在后续扩展时,创建批量任务配置文件,自动执行任务才加入的):

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using DotnetSpider.Core;
    using DotnetSpider.Core.Downloader;
    using DotnetSpider.Core.Pipeline;
    using DotnetSpider.Core.Processor;
    using DotnetSpider.Core.Scheduler;
    using ISee.Shaun.Spiders.Common;
    using ISee.Shaun.Spiders.Pipeline;
    using ISee.Shaun.Spiders.Processor;
    
    namespace ISee.Shaun.Spiders.Business
    {
        public class RunSpider
        {
            private const string ASSEMBLY_PROCESSOR_NAME = "ISee.Shaun.Spiders.Processor";
            private const string ASSEMBLY_PIPELINE_NAME = "ISee.Shaun.Spiders.Pipeline";
            private BaseProcessor processor = null;
            private BasePipeline pipeline = null;
            private Site site = null;
            private string encoding = string.Empty;
            private bool removeOutBound = false;
    
            private int spiderThreadNums = 1;
            public int SpiderThreadNums { get => spiderThreadNums; set => spiderThreadNums = value; }
    
            /// <summary>
            /// Constructor
            /// </summary>
            /// <param name="processorName"></param>
            /// <param name="pipeLineName"></param>
            public RunSpider(string processorName, string pipeLineName, string encoding, bool removeOutBound)
            {
                //通过反射,获取当前处理类
                processor = ReflectionInvoke.GetInstance(ASSEMBLY_PROCESSOR_NAME, processorName, null) as BaseProcessor;
                //如果需要回写信息,使用当前委托,如这里,继续子页面的抓取调用
                processor.InvokeFoodUrls = this.InvokeNext;
                pipeline = ReflectionInvoke.GetInstance(ASSEMBLY_PIPELINE_NAME, pipeLineName, null) as BasePipeline;
                this.encoding = encoding;
                this.removeOutBound = removeOutBound;
            }
    
            /// <summary>
            /// 执行,按照页号
            /// </summary>
            /// <param name="urlInfo"></param>
            /// <param name="times"></param>
            public void Run(string urlInfo, int times)
            {
                SetSite(encoding, removeOutBound, urlInfo, times);
                Run();
            }
    
            /// <summary>
            /// 执行,按照地址集合
            /// </summary>
            /// <param name="urlList"></param>
            public void Run(List<string> urlList)
            {
                SetSite(encoding, removeOutBound, urlList);
                Run();
            }
    
            /// <summary>
            /// Begin spider
            /// </summary>
            private void Run()
            {
                Spider spider = Spider.Create(site, new QueueDuplicateRemovedScheduler(), processor);
                spider.AddPipeline(pipeline);
                spider.Downloader = new HttpClientDownloader();
                spider.ThreadNum = this.spiderThreadNums;
                spider.EmptySleepTime = 3000;
                spider.Deep = 3;
                spider.Run();
            }
    
            private void InvokeNext(string processorName, string pipeLineName, List<string> foodUrls)
            {
                RunSpider runSpider = new RunSpider(processorName, pipeLineName, this.encoding, true);
                runSpider.Run(foodUrls);
            }
    
            /// <summary>
            /// 通过可变页号,设定站点URL
            /// </summary>
            /// <param name="encoding"></param>
            /// <param name="removeOutBound"></param>
            /// <param name="urlInfo"></param>
            /// <param name="times"></param>
            private void SetSite(string encoding, bool removeOutBound, string urlInfo, int times)
            {
                this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false };
                if (times == 0)
                {
                    this.site.AddStartUrl(urlInfo);
                }
                else
                {
                    List<string> urls = new List<string>();
                    for (int i = 1; i <= times; ++i)
                    {
                        urls.Add(string.Format(urlInfo, i));
                    }
                    this.site.AddStartUrls(urls);
                }
            }
    
            /// <summary>
            /// 通过URL集合设置站点URL
            /// </summary>
            /// <param name="encoding"></param>
            /// <param name="removeOutBound"></param>
            /// <param name="urlList"></param>
            private void SetSite(string encoding, bool removeOutBound, List<string> urlList)
            {
                this.site = new Site { EncodingName = encoding, RemoveOutboundLinks = false };
                this.site.AddStartUrls(urlList);
            }
        }
    }

    关于Processor,我后续会扩展出不通的网站实现类,那么涉及到通用属性等需要进行抽象处理,代码如下:

    using DotnetSpider.Core;
    using DotnetSpider.Core.Processor;
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using static ISee.Shaun.Spiders.Common.DelegeteDefine;
    
    namespace ISee.Shaun.Spiders.Processor
    {
        public class BaseProcessor : BasePageProcessor
        {
            protected List<string> foodUrls = null;
            public CallbackEventHandler InvokeFoodUrls { get; set; }
    
            protected string SourceWebsite { get; set; }
    
            public BaseProcessor() { foodUrls = new List<string>(); }
    
            protected override void Handle(Page page)
            {
                throw new NotImplementedException();
            }
    
            protected virtual void InvokeCallback(string processorName, string pipeLineName)
            {
                if (InvokeFoodUrls != null && this.foodUrls.Count > 0)
                {
                    InvokeFoodUrls(processorName, pipeLineName, this.foodUrls);
                }
            }
        }
    }

    接下来看具体的实现类(关于xpath不在多加说明,网上资料很多,如果结构不清楚,可以使用谷歌的开发者工具,或者在调试中,拿到html结构,自行分析,本文不再增加次类演示截图):

    using DotnetSpider.Core;
    using DotnetSpider.Core.Processor;
    using DotnetSpider.Core.Selector;
    using ISee.Shaun.Spiders.Common;
    using ISee.Shaun.Spiders.SpiderModel.Model;
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using static ISee.Shaun.Spiders.Common.DelegeteDefine;
    
    namespace ISee.Shaun.Spiders.Processor
    {
        public class DazhongdianpingProcessor : BaseProcessor
        {
            public DazhongdianpingProcessor() : base()
            {
                //标记当前数据来源
                SourceWebsite = "大众点评";
            }
    
            /// <summary>
            /// 重新父类方法,开始执行数据获取操作
            /// </summary>
            /// <param name="page"></param>
            protected override void Handle(Page page)
            {
                // 利用 Selectable 查询并构造自己想要的数据对象
                var totalVideoElements = page.Selectable.SelectList(Selectors.XPath(".//div[@class='shop-list J_shop-list shop-all-list']/ul/li")).Nodes();
                if (totalVideoElements == null)
                {
                    return;
                }
                //定义需处理数据集合
                List<Restaurant> restaurantList = new List<Restaurant>();
                foreach (var restElement in totalVideoElements)
                {
                    var restaurant = new Restaurant() { SourceWebsite = SourceWebsite };
                    //下面通过xpath开始获取餐厅信息
                    restaurant.Name = restElement.Select(Selectors.XPath(".//h4")).GetValue();
                    var price= restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a[@class='mean-price']/b")).GetValue();
                    restaurant.AveragePrice = string.IsNullOrEmpty(price) ? "0" : price.Replace("","");
                    restaurant.Type = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a/span[@class='tag']")).GetValue();
                    restaurant.Star = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='comment']/span/@title")).GetValue();
                    restaurant.ImageUrl = restElement.Select(Selectors.XPath(".//div[@class='pic']/a/img/@src")).GetValue();
                    var areaCode = page.Url.Substring(page.Url.LastIndexOf('/')+1);
                    if (!string.IsNullOrEmpty(areaCode) && (areaCode.Contains("r")|| areaCode.Contains("c")))
                    {
                        Dictionary<string, string> areaDic = DazhongdianpingArea.GetAreaDic();
                        string result= areaCode.Substring(0, areaCode.IndexOf('p'));
                        if (areaDic.ContainsKey(result))
                        {
                            restaurant.Area = areaDic[result];
                        }
                    }
    
                    List<ISelectable> infoList = restElement.SelectList(Selectors.XPath("./div[@class='txt']/span[@class='comment-list']/span/b")).Nodes() as List<ISelectable>;
                    if (infoList != null && infoList.Count > 0)
                    {
                        var result = infoList[0].GetValue();
                        restaurant.Taste = string.IsNullOrEmpty(result) ? string.Empty : result;
                        result = infoList[1].GetValue();
                        restaurant.Environment = string.IsNullOrEmpty(result) ? string.Empty : result;
                        result = infoList[2].GetValue();
                        restaurant.ServiceScore = string.IsNullOrEmpty(result) ? string.Empty : result;
                    }
    
                    var recommetList = restElement.SelectList(Selectors.XPath(".//div[@class='txt']/div[@class='recommend']/a")).Nodes();
                    restaurant.Recommendation = string.Join(",", recommetList.Select(o => o.GetValue()));
                    restaurant.Address = restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/span")).GetValue();
                    restaurant.Position= restElement.Select(Selectors.XPath(".//div[@class='txt']/div[@class='tag-addr']/a[@data-click-name='shop_tag_region_click']/span[@class='tag']")).GetValue();
    
                    var shopUrl = restElement.Select(Selectors.XPath(".//div[@class='txt']/div/a/@href")).GetValue();
                    restaurant.Code = shopUrl.Substring(shopUrl.LastIndexOf('/') + 1);
                    restaurantList.Add(restaurant);
    
                    //add next links
                    if (!string.IsNullOrEmpty(shopUrl))
                    {
                        this.foodUrls.Add(shopUrl);
                    }
                }
                // 如果进行二级爬虫,取消注释,并且实现对应的两个类
                //InvokeCallback("DazhongdianpingFoodProcessor", "DazhongdianpingFoodPipeline");
                // Save data object by key. 以自定义KEY存入page对象中供Pipeline调用
                page.AddResultItem("RestaurantList", restaurantList);
            }
        }
    }

    数据实体的定义:

    using System;
    using System.Collections.Generic;
    using System.ComponentModel.DataAnnotations;
    using System.ComponentModel.DataAnnotations.Schema;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    
    namespace ISee.Shaun.Spiders.SpiderModel.Model
    {
        public class FoodInfo
        {
            [Key]
            public int Id { get; set; }
            public int RestaurantId { get; set; }
            public string Code { get; set; }
            public string RestaurantCode { get; set; }
            public string Name { get; set; }
            public string Price { get; set; }
            public string FoodImageUrl { get; set; }
            [ForeignKey("RestaurantId")]
            public Restaurant restaurant { get; set; }
        }
    }

    数据获取下来之后,爬虫会自动将任务分配给pipeline来处理收集到的数据信息,直接上代码:

    using DotnetSpider.Core;
    using DotnetSpider.Core.Pipeline;
    using ISee.Shaun.Spiders.SpiderModel.Model;
    using ISee.Shaun.Spiders.SpiderModel;
    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    
    namespace ISee.Shaun.Spiders.Pipeline
    {
        public class DazhongdianpingPipeline : BasePipeline
        {
            /// <summary>
            /// 处理餐厅信息
            /// </summary>
            /// <param name="resultItems"></param>
            /// <param name="spider"></param>
            public override void Process(IEnumerable<ResultItems> resultItems, ISpider spider)
            {
                //便利结果集
                foreach (ResultItems entry in resultItems)
                {
                    //定义EF实体
                    using (var rEntity = new FoodInfoEntity())
                    {
                        List<Restaurant> resList = new List<Restaurant>();
                        foreach (Restaurant result in entry.Results["RestaurantList"])
                        {
                            //通过餐厅名称和地址作为筛重条件
                            var resultList = rEntity.RestaurantInfo.Where(o => o.Name == result.Name && o.Address == result.Address).ToList();
                            if (resultList.Count == 0)
                            {
                                resList.Add(result);
                            }
                        }
                        if (resList.Count > 0)
                        {
                            rEntity.RestaurantInfo.AddRange(resList);
                            rEntity.SaveChanges();
                        }
                    }
                }
    
            }
        }
    }

    好了,整体下来,就是这样简单,当然我还要强调一下几个问题:

    1.如果需要对大量的页面进行数据爬取,可增加额外的xml配置文件,来定义抓取的规则或者任务。(不再细说,如有疑问可留言交流)

    2.如果要完成比如美团网等网站的扩展,在Processor和Pipeline分别实现对应的类即可

    3.关于数据实体,我采用了EF的Code first方式,大家可以随意扩展自己想要的方式,或者更换数据库等,请参阅网上大量的关于EF的文章。

    今天就到这里了,基本都在上代码,如何理解各自体会吧,另外,下周开始,停发两年多的1024伐木累还会继续更新,只想好好的把这件事做完,愿一切安好!

     补充,Github地址:https://github.com/sall84993356/Spiders.git

  • 相关阅读:
    XML to bean
    监控Informix-Url
    ld-linux.so.2: bad ELF interpreter
    Oracle
    我的北漂生活:我来北京一个月了
    C/C++ 常量的定义与应用(编程中的常量)
    C/C++ 常量的定义与应用(编程中的常量)
    英文名字的昵称(亲切的叫法)
    英文名字的昵称(亲切的叫法)
    动态规划的关键 —— 子问题 & 公式化
  • 原文地址:https://www.cnblogs.com/sall/p/9031868.html
Copyright © 2011-2022 走看看