zoukankan      html  css  js  c++  java
  • 基于scrapy框架的增量式爬虫

    增量式

    • 概念:检测网站数据更新的情况。爬取到最新更新出来的数据。

    • 核心:去重

    • 记录表:需要持久化存储。redis中set

      • 记录爬取过的信息
        • 爬取过的电影详情页的url:对应的是深度爬取
        • 数据指纹:对应的非深度爬取(一张页面的数据更新)
          • 数据指纹:一组数据的唯一标识
    • 代码实现:

      # -*- coding: utf-8 -*-
      import scrapy
      from scrapy.linkextractors import LinkExtractor
      from scrapy.spiders import CrawlSpider, Rule
      from redis import Redis
      from moviePro.items import MovieproItem
      class MovieSpider(CrawlSpider):
          name = 'movie'
          # allowed_domains = ['www.xxx.com']
          start_urls = ['https://www.4567tv.tv/frim/index1.html']
          
          conn = Redis(host='127.0.0.1',port=6379)
          
          link = LinkExtractor(allow=r'frim/index1-d+.html')#提取页码链接
          rules = (
              Rule(link, callback='parse_item', follow=False),
          )
      
          def parse_item(self, response):
              #电影名称+详情页的url
              li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
              for li in li_list:
                  name = li.xpath('./div/a/@title').extract_first()
                  item = MovieproItem()
                  item['name'] = name
                  detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first()
                  ex = self.conn.sadd('movie_record',detail_url)
                  
                  if ex == 1:#这部电影之前没有存在于记录表中
                      print('有最新更新的数据!!!!!!')
                      yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})
                  else:
                      print('暂无新数据的更新......')
                      
          def parse_detail(self,response):
              item = response.meta['item']
              desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()
              item['desc'] = desc
              yield item
      

      pipelines.py文件的代码:

      class MovieproPipeline(object):
          def process_item(self, item, spider):
              conn = spider.conn
              conn.lpush('movieData',item)
              return item
      
  • 相关阅读:
    python的函数修饰符(装饰器)
    hdu1175连连看(dfs+细节)
    hdu2553N皇后问题(dfs,八皇后)
    hdu1045Fire Net(经典dfs)
    hdu1050Moving Tables(贪心)
    hdu2037今年暑假不AC(贪心,活动安排问题)
    hdu1052Tian Ji -- The Horse Racing(贪心,细节多)
    hdu1009FatMouse' Trade(贪心)
    hdu1455Sticks(经典dfs+剪枝)
    hdu2509Be the Winner(反nim博弈)
  • 原文地址:https://www.cnblogs.com/zhufanyu/p/12020539.html
Copyright © 2011-2022 走看看