zoukankan      html  css  js  c++  java
  • Scrapy-redis组件

    核心:共享爬取队列

    目的:实现分布式

    一、安装

    pip3 install -i https://pypi.douban.com/simple scrapy-redis

    二、去重

    1、配置文件

    scrapy 去重

    DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
    DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

    scrapy连接redis

    REDIS_HOST = 'ip'                            
    REDIS_PORT = 端口号                                   
    REDIS_PARAMS  = {'password':'密码'}                                 
    REDIS_ENCODING = "utf-8"# REDIS_URL = 'redis://user:密码@ip:端口'   (优先于以上配置)

    2、自定义类

    通过继承RFPDupeFilter和重写from_settings方法,设置默认的key

    class RedisDupeFilter(RFPDupeFilter):
        @classmethod
        def from_settings(cls, settings):
            server = get_redis_from_settings(settings)
            key = defaults.DUPEFILTER_KEY % {'timestamp': '固定的key''}
            debug = settings.getbool('DUPEFILTER_DEBUG')
            return cls(server, key=key, debug=debug)

    配置文件修改DUPEFILTER_CLASS的路径即可

     三、调度器

    # 连接redis
    REDIS_HOST = 'ip'        # ip
    REDIS_PORT = 端口        # 端口
    REDIS_PARAMS  = {'password':'密码'} 
    REDIS_ENCODING = "utf-8"        # redis编码类型,默认:'utf-8'
    
    # 去重
    DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
    DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
    
    # 调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    
    DEPTH_PRIORITY = 1  # 广度优先
    # DEPTH_PRIORITY = -1 # 深度优先
    SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'  
    
    # 广度优先,先进先出
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'  
    # 深度优先,后进先出
    # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'  
    SCHEDULER_QUEUE_KEY = '%(spider)s:requests'  # 调度器中请求存放在redis中的key
    
    SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  # 对保存到redis中的数据进行序列化,默认使用pickle
    
    SCHEDULER_PERSIST = False  # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
    SCHEDULER_FLUSH_ON_START = True  # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
    # SCHEDULER_IDLE_BEFORE_CLOSE = 10  # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
    
    # 优先使用DUPEFILTER_CLASS,如果每有就是用SCHEDULER_DUPEFILTER_CLASS
    SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'  # 去重规则,在redis中保存时对应的key
    SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重规则对应处理的类

    四、起始URL

    作用:爬虫的进程始终夯在哪里,等待新任务的到来,及起始的url

    from scrapy_redis.spiders import RedisSpider

    爬虫类继承 RedisSpider

    另外定义一个py文件

    import redis
    
    conn = redis.Redis(host='ip',port=端口,password='密码')
    conn.lpush('爬虫name名称:start_urls','url')
  • 相关阅读:
    2018.09.25python学习第十天part3
    2018.09.25python学习第十天part2
    2018.09.25python学习第十天part1
    2018.09.21python学习第九天part3
    2018.09.21python学习第九天part2
    2018.09.21python学习第九天part1
    2018.09.20python作业
    Alpha 冲刺(3/10)
    Alpha 冲刺(2/10)
    Alpha 冲刺(1/10)
  • 原文地址:https://www.cnblogs.com/wt7018/p/11756393.html
Copyright © 2011-2022 走看看