zoukankan      html  css  js  c++  java
  • scrapy爬虫系列之七--scrapy_redis的使用

    功能点:如何发送携带cookie访问登录后的页面,如何发送post请求登录

    简单介绍:

    • 安装:pip3 install scrapy_redis
    • 在scrapy的基础上实现了更多的功能:如request去重(增量爬虫),爬虫持久化,实现分布式
    • 工作流程:通过redis实现调度器的队列和指纹集合;每个request生成一个指纹,在存入redis之前,首先判断这个指纹是否已经存在,如果不存在则存入。
    • 配置:
      • # 确保所有的爬虫通过Redis去重
        DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
        # 启用Redis调度存储请求队列
        SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
        # 不清除Redis队列、这样可以暂停/恢复 爬取
        SCHEDULER_PERSIST = True
        # 保存item到redis
        ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
        }
        # reids服务器地址
        REDIS_URL = 'redis://192.168.3.20:6379'

    • redis中访问:
      • keys *里有3个键
        爬虫名:requests Scheduler队列,存放待请求的request对象,获取的过程是pop操作,即获取一个会去除一个
        爬虫名:dupefilter 指纹集合,存放的是已经进入scheduler队列的request对象的指纹,指纹默认由请求方法、url和请求体组成
        爬虫名:items 存放的是获取到的item信息,在 pipeline中 开启才会存入。

    • request对象什么时候入队?
      • 1、dont_filter=True,构造请求的时候,把dont_filter设置为True,该url会被反复抓取(url地址对应的内容会更新的情况,类似百度贴吧)
        2、一个全新的url地址被抓到的时候,构造request请求
        3、url地址在start_urls中的时候,会入队,不管之前是否请求过(原因:构造start_url地址的请求的时候,dont_filter=True)
        4、代码,scheduler.py的enqueue_request方法:

        def enqueue_request(self, request):
            if not request.dont_filter and self.df.request_seen(request):
                self.df.log(request, self.spider)
                return False
            self.queue.push(request)
            return True
    • scrapy_redis的去重方法
      • 使用sha1加密request得到指纹
        把指纹存到redis的集合中
        下次新来的request,同样的方式生成指纹,判断指纹是否存在redis的集合中

    • 生成指纹
      • fp = hashlib.sha1()
        fp.update(to_bytes(request.method))
        fp.update(to_bytes(canonicalize_url(request.url)))
        fp.update(request.body or b'')
        return fp.hexdigest()
    • 判断数据是否存在redis的集合中,不存在则插入
      • dupefilter.py的request_seen方法
        added = self.server.sadd(self.key, fp)
        return added == 0
    • 在引入scrapy_redis之前,scrapy是怎么去重的?  
      • scrapy.dupefilters.py的 RFPDupeFilter,request_seen方法:
        # 首先也会生成一个指纹,判断是否在集合中,如果存在,说明已经抓取过
        # requests.seen文件
        def request_seen(self, request):
                fp = self.request_fingerprint(request)
                if fp in self.fingerprints:
                    return True
                self.fingerprints.add(fp)
                if self.file:
                    self.file.write(fp + os.linesep)

    爬取网站:dmoz

    完整代码:https://files.cnblogs.com/files/bookwed/dmoztools.zip

    主要代码:

    dmoz.py

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class DmozSpider(CrawlSpider):
        name = 'dmoz'
        allowed_domains = ['dmoztools.net']
        start_urls = ['http://www.dmoztools.net']
    
        rules = (
            Rule(LinkExtractor(
                restrict_css=('.top-cat', '.sub-cat', 'cat-item')
            ), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            for div in response.css(".title-and-desc"):
                yield {
                    'name': div.css(".site-title::text").extract_first(),
                    'desc': div.css(".site-descr::text").extract_first().strip(),
                    'link': div.css("a::attr(href)").extract_first(),
                }

    settings.py

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for dmoztools project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'dmoztools'
    
    SPIDER_MODULES = ['dmoztools.spiders']
    NEWSPIDER_MODULE = 'dmoztools.spiders'
    
    # 确保所有的爬虫通过Redis去重
    DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
    # 启用Redis调度存储请求队列
    SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
    # 不清除Redis队列、这样可以暂停/恢复 爬取;队列中的内容是否持久化
    SCHEDULER_PERSIST = True
    # 使用优先级调度请求队列 (默认使用)
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
    # 可选用的其它队列
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'
    # 最大空闲时间防止分布式爬虫因为等待而关闭
    # SCHEDULER_IDLE_BEFORE_CLOSE = 10
    
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'dmoztools (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'dmoztools.middlewares.DmoztoolsSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'dmoztools.middlewares.DmoztoolsDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'dmoztools.pipelines.DmoztoolsPipeline': 300,
       'scrapy_redis.pipelines.RedisPipeline': 400
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    REDIS_URL = 'redis://192.168.3.20:6379'
  • 相关阅读:
    SPI(1)——Documentation/spi/spi_summary.txt翻译
    TCP协议详解(TCP报文、三次握手、四次挥手、TIME_WAIT状态、滑动窗口、拥塞控制、粘包问题、状态转换图)
    Linux设备树(3)——Linux内核对设备树的处理
    Linux设备树(2)——设备树格式和使用
    [转]Android的taskAffinity
    [转]深入了解iPad上的MouseEvent
    NG2入门
    TypeScript 素描
    TypeScript 素描
    TypeScript 素描
  • 原文地址:https://www.cnblogs.com/bookwed/p/10648568.html
Copyright © 2011-2022 走看看