zoukankan      html  css  js  c++  java
  • [开源 .NET 跨平台 Crawler 数据采集 爬虫框架: DotnetSpider] [三] 配置式爬虫

    [DotnetSpider 系列目录]

    上一篇介绍的基本的使用方式,自由度很高,但是编写的代码相对就多了。而我所在的行业其实大部分都是定题爬虫, 只需要采集指定的页面并结构化数据。为了提高开发效率, 我实现了利用实体配置的方式来实现爬虫

    创建 Console 项目

    利用NUGET添加包

    DotnetSpider2.Extension

    定义配置式数据对象

    • 数据对象必须继承 ISpiderEntity
    • Schema 定义数据名称、表名及表名后缀
    • Indexes 定义数据表的主键、唯一索引、索引
    • EntitySelector 定义从页面数据中抽取数据对象的规则

    定义一个原始的数据对象类

    public class Product : ISpiderEntity
    {
    }
    使用Chrome打开京东商品页 http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main
    1. 使用快捷键F12打开开发者工具
    2. 选中一个商品,并观察Html结构

              image

    我们发现每个商品都在class为gl-i-wrap j-sku-item的DIV下面,因此添加EntitySelector到数据对象Product的类名上面。( XPath的写法不是唯一的,不熟悉的可以去W3CSCHOLL学习一下, 框架也支持使用Css甚至正则来选择出正确的Html片段)。 

        [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")] public class Product : ISpiderEntity 

    1. 添加数据库及索引信息

      [Schema("test", "sku", TableSuffix.Today)]
      [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]
      [Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]
      public class Product : ISpiderEntity
    2. 假设你需要采集SKU信息,观察HTML结构,计算出相对的XPath, 为什么是相对XPath?因为EntitySelector已经把HTML截成片段了,内部的Html元素查询都是相对于EntitySelector查询出来的元素。最后再加上数据库中列的信息

      复制代码
      [Schema("test", "sku", TableSuffix.Today)]
      [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]
      [Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]
      public class Product : ISpiderEntity
      {
           [StoredAs("sku", DataType.String, 25)]
           [PropertySelector(Expression = "./@data-sku")]
           public string Sku { get; set; }
       }
      复制代码
    3. 爬虫内部,链接是通过Request对象来存储信息的,构造Request对象时可以添加额外的属性值,这时候允许数据对象从Request的额外属性值中查询数据

      [StoredAs("category", DataType.String, 20)]
      [PropertySelector(Expression = "name", Type = SelectorType.Enviroment)]
      public string CategoryName { get; set; }
    配置爬虫(继承EntitySpiderBuilder)
    复制代码
        protected override EntitySpider GetEntitySpider()
        {
            EntitySpider context = new EntitySpider(new Site
            {
                //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API"))
            })
            {
                UserId = "DotnetSpider",
                TaskGroup = "JdSkuSampleSpider"
            };
            context.SetThreadNum(1);
            context.SetIdentity("JD_sku_store_test_" + DateTime.Now.ToString("yyyy_MM_dd_hhmmss"));
            context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
            context.AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手机" }, { "cat3", "655" } });
            context.AddEntityType(typeof(Product), new TargetUrlExtractor
            {
                Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class="p-num"]" },
                Patterns = new List<string> { @"&page=[0-9]+&" }
            });
            return context;
        }
    复制代码
    1. 其中AddStartUrl第二个参数Dictionary<string, object>就是用于Enviroment查询的数据

    2. 配置Scheduler: 默认是使用内存Queue做Url调度,如果想使用多台机器分布式采集则需要配置为RedisScheduler

      context.SetScheduler(new RedisScheduler
       {
           Host = "",
           Password = "",
           Port = 6379
       });
    3. 在添加数据对象时,可以配置数据链接的合法性验证。用在一个网站采集多种链接时映射到不同的数据对象。同时此验证会抽取当前页面中符合规则的Url加入到Scheduler中继续采集。

      context.AddEntityType(typeof(Product), new TargetUrlExtractor
      {
           Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class="p-num"]" },
          Patterns = new List<string> { @"&page=[0-9]+&" }
      });

              image

    1. 添加一个MySql的数据管道,只需要配置好连接字符串即可

      context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
    完整代码
    
    
    复制代码
    public class JdSkuSampleSpider : EntitySpiderBuilder
        {
            protected override EntitySpider GetEntitySpider()
            {
                EntitySpider context = new EntitySpider(new Site
                {
                    //HttpProxyPool = new HttpProxyPool(new KuaidailiProxySupplier("快代理API"))
                })
                {
                    UserId = "DotnetSpider",
                    TaskGroup = "JdSkuSampleSpider"
                };
                context.SetThreadNum(1);
                context.SetIdentity("JD_sku_store_test_" + DateTime.Now.ToString("yyyy_MM_dd_hhmmss"));
                context.AddEntityPipeline(new MySqlEntityPipeline("Database='test';Data Source=localhost;User ID=root;Password=1qazZAQ!;Port=3306"));
                context.AddStartUrl("http://list.jd.com/list.html?cat=9987,653,655&page=2&JL=6_0_0&ms=5#J_main", new Dictionary<string, object> { { "name", "手机" }, { "cat3", "655" } });
                context.AddEntityType(typeof(Product), new TargetUrlExtractor
                {
                    Region = new BaseSelector { Type = SelectorType.XPath, Expression = "//span[@class="p-num"]" },
                    Patterns = new List<string> { @"&page=[0-9]+&" }
                });
                return context;
            }
    
            [Schema("test", "sku", TableSuffix.Today)]
            [EntitySelector(Expression = "//li[@class='gl-item']/div[contains(@class,'j-sku-item')]")]
            [Indexes(Index = new[] { "category" }, Unique = new[] { "category,sku", "sku" })]
            public class Product : ISpiderEntity
            {
                [StoredAs("sku", DataType.String, 25)]
                [PropertySelector(Expression = "./@data-sku")]
                public string Sku { get; set; }
    
                [StoredAs("category", DataType.String, 20)]
                [PropertySelector(Expression = "name", Type = SelectorType.Enviroment)]
                public string CategoryName { get; set; }
    
                [StoredAs("cat3", DataType.String, 20)]
                [PropertySelector(Expression = "cat3", Type = SelectorType.Enviroment)]
                public int CategoryId { get; set; }
    
                [StoredAs("url", DataType.Text)]
                [PropertySelector(Expression = "./div[1]/a/@href")]
                public string Url { get; set; }
    
                [StoredAs("commentscount", DataType.String, 32)]
                [PropertySelector(Expression = "./div[5]/strong/a")]
                public long CommentsCount { get; set; }
    
                [StoredAs("shopname", DataType.String, 100)]
                [PropertySelector(Expression = ".//div[@class='p-shop']/@data-shop_name")]
                public string ShopName { get; set; }
    
                [StoredAs("name", DataType.String, 50)]
                [PropertySelector(Expression = ".//div[@class='p-name']/a/em")]
                public string Name { get; set; }
    
                [StoredAs("venderid", DataType.String, 25)]
                [PropertySelector(Expression = "./@venderid")]
                public string VenderId { get; set; }
    
                [StoredAs("jdzy_shop_id", DataType.String, 25)]
                [PropertySelector(Expression = "./@jdzy_shop_id")]
                public string JdzyShopId { get; set; }
    
                [StoredAs("run_id", DataType.Date)]
                [PropertySelector(Expression = "Monday", Type = SelectorType.Enviroment)]
                public DateTime RunId { get; set; }
    
                [PropertySelector(Expression = "Now", Type = SelectorType.Enviroment)]
                [StoredAs("cdate", DataType.Time)]
                public DateTime CDate { get; set; }
            }
        }
    复制代码
    
    
    运行爬虫
    复制代码
    public class Program
    {
        public static void Main(string[] args)
        {
            JdSkuSampleSpider spiderBuilder = new JdSkuSampleSpider();
            spiderBuilder.Run("rerun");
        }
    }
    复制代码

    image

    不到100行代码完成一个爬虫,是不是异常的简单?

  • 相关阅读:
    迷你版jQuery——zepto核心源码分析
    zepto.js 源码解析
    zepto.js swipe实现触屏tab菜单
    zepto.js 处理Touch事件
    Zepto 使用中的一些注意点(转)
    判断js对象的数据类型,有没有一个最完美的方法?
    html 5 本地数据库(Web Sql Database)核心方法openDatabase、transaction、executeSql 详解
    HTML5本地存储——Web SQL Database
    js事件监听器用法实例详解-注册与注销监听封装
    10 个非常有用的 AngularJS 框架
  • 原文地址:https://www.cnblogs.com/jjg0519/p/6707540.html
Copyright © 2011-2022 走看看