zoukankan      html  css  js  c++  java
  • [开源 .NET 跨平台 Crawler 数据采集 爬虫框架: DotnetSpider] [四] JSON数据解析

    [DotnetSpider 系列目录]

    场景模拟

    接上一篇, 假设由于漏存JD SKU对应的店铺信息。这时我们需要重新完全采集所有的SKU数据吗?补爬的话历史数据就用不了了。因此,去京东页面上找看是否有提供相关的接口。

    查找API请求接口

    1. 安装 Fiddler, 并打开

    2. 在谷歌浏览器中访问: http://list.jd.com/list.html?cat=1315,1343,9719

    3. 在Fiddler查找一条条的访问记录,找到我们想要的接口

        image

    编写爬虫

    1. 分析返回的数据结果,我们可以先写出数据对象的定义(观察Expression的值已经是JsonPath查询表达式了,同时Type必须设置为Type = SelectorType.JsonPath)。另外需要注意的是,这次的爬虫是更新型爬虫,就是说采集到的数据补充回原表,那么就一定要设置主键是什么,即在数据类上添加主键的定义

      复制代码
      [Schema("jd", "sku_v2", TableSuffix.Monday)]
      [EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)]
      [Indexes(Primary = "sku")]
      public class ProductUpdater : ISpiderEntity
      {
           [StoredAs("sku", DataType.String, 25)]
           [PropertySelector(Expression = "$.pid", Type = SelectorType.JsonPath)]
           public string Sku { get; set; }
      
           [StoredAs("shopname", DataType.String, 100)]
           [PropertySelector(Expression = "$.seller", Type = SelectorType.JsonPath)]
           public string ShopName { get; set; }
      
           [StoredAs("shopid", DataType.String, 25)]
           [PropertySelector(Expression = "$.shopId", Type = SelectorType.JsonPath)]
           public string ShopId { get; set; }
       }
      复制代码
    2. 定义Pipeline的类型为Update

      context.AddEntityPipeline(new MySqlEntityPipeline
       {
           ConnectString = "Database='taobao';Data Source= ;User ID=root;Password=1qazZAQ!;Port=4306",
           Mode = PipelineMode.Update
       });
    3. 由于返回的数据中还有一个json()这样的pagging,所以需要先做一个截取操作,框架提供了PageHandler接口,并且我们实现了大量常用的Handler,用于HTML的解析前的一些处理操作,因此完整的代码如下

      复制代码
          public class JdShopDetailSpider : EntitySpiderBuilder
          {
              protected override EntitySpider GetEntitySpider()
              {
                  var context = new EntitySpider(new Site())
                  {
                      TaskGroup = "JD SKU Weekly",
                      Identity = "JD Shop details " + DateTimeUtils.MondayRunId,
                      CachedSize = 1,
                      ThreadNum = 8,
                      Downloader = new HttpClientDownloader
                      {
                          DownloadCompleteHandlers = new IDownloadCompleteHandler[]
                          {
                              new SubContentHandler
                              {
                                  Start = "json(",
                                  End = ");",
                                  StartOffset = 5,
                                  EndOffset = 0
                              }
                          }
                      },
                      PrepareStartUrls = new PrepareStartUrls[]
                      {
                          new BaseDbPrepareStartUrls()
                          {
                              Source = DataSource.MySql,
                              ConnectString = "Database='test';Data Source= localhost;User ID=root;Password=1qazZAQ!;Port=3306",
                              QueryString = $"SELECT * FROM jd.sku_v2_{DateTimeUtils.MondayRunId} WHERE shopname is null or shopid is null order by sku",
                              Columns = new [] {new DataColumn { Name = "sku"} },
                              FormateStrings = new List<string> { "http://chat1.jd.com/api/checkChat?my=list&pidList={0}&callback=json" }
                          }
                      }
                  };
                  context.AddEntityPipeline(new MySqlEntityPipeline
                  {
                      ConnectString = "Database='taobao';Data Source=localhost ;User ID=root;Password=1qazZAQ!;Port=4306",
                      Mode = PipelineMode.Update
                  });
                  context.AddEntityType(typeof(ProductUpdater), new TargetUrlExtractor
                  {
                      Region = new Selector { Type = SelectorType.XPath, Expression = "//*[@id="J_bottomPage"]" },
                      Patterns = new List<string> { @"&page=[0-9]+&" }
                  });
                  return context;
              }
      
              [Schema("jd", "sku_v2", TableSuffix.Monday)]
              [EntitySelector(Expression = "$.[*]", Type = SelectorType.JsonPath)]
              [Indexes(Primary = "sku")]
              public class ProductUpdater : ISpiderEntity
              {
                  [StoredAs("sku", DataType.String, 25)]
                  [PropertySelector(Expression = "$.pid", Type = SelectorType.JsonPath)]
                  public string Sku { get; set; }
      
                  [StoredAs("shopname", DataType.String, 100)]
                  [PropertySelector(Expression = "$.seller", Type = SelectorType.JsonPath)]
                  public string ShopName { get; set; }
      
                  [StoredAs("shopid", DataType.String, 25)]
                  [PropertySelector(Expression = "$.shopId", Type = SelectorType.JsonPath)]
                  public string ShopId { get; set; }
              }
          }
      复制代码
  • 相关阅读:
    Mac014--Sourcetree安装(Git client)
    SSM003/构建Maven单模块项目(二)
    Git016--Work
    Mac013--Docker安装
    前端002/常用标签属性(工作应用)
    Python 38 初识数据库
    Python 38 sql基础
    Python 39 数据库的数据类型
    Python 39 数据库
    Python 37 进程池与线程池 、 协程
  • 原文地址:https://www.cnblogs.com/jjg0519/p/6707547.html
Copyright © 2011-2022 走看看