zoukankan      html  css  js  c++  java
  • 22-爬虫之scrapy框架分布式09

    分布式

    在这里插入图片描述

    • 实现分布式的方式:scrapy+redis(scrapy结合着scrapy-redis组件)
    • 原生的scrapy框架是无法实现分布式的
      • 什么是分布式
        • 需要搭建一个分布式机群,然后让机群中的每一台电脑执行同一组程序,让其对同一组资源进行联合且分布的数据爬取。
        • 因调度器,管道无法被分布式机群共享所以原生架构scrapy无法实现分布式
        • 使用scrapy-redis组件可以给原生的scrapy框架提供共享管道和调度器实现分布式
          • pip install scrapy-redis

    实现流程

    创建工程

    创建一个爬虫工程:scrapy startproject proName
    进入工程创建一个基于CrawlSpider的爬虫文件
    scrapy genspider -t crawl spiderName www.xxx.com
    执行工程:scrapy crawl spiderName

    1,修改爬虫文件

    • 1.1 导包:from scrapy_redis.spiders import RedisCrawlSpider
    • 1.2 修改当前爬虫类的父类为:RedisCrawlSpider
    • 1.3 将start_url 替换成redis_key的属性,属性值为任意字符串
      • redis_key=‘xxxx’ #可以被共享的调度器队列的名称/稍后我们需要将一个起始url手动添加到redis_key表示的队列中
    • 1.4 将数据解析的操作补充完成即可

    fbs.py 爬虫源文件

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy_redis.spiders import RedisCrawlSpider
    from fbsPro.items import FbsproItem
    
    class FbsSpider(RedisCrawlSpider):
        name = 'fbs'
        #allowed_domains = ['www.xxx.com']
        #start_urls = ['http://www.xxx.com/']
        redis_key = 'sunQueue' #可以被共享的调度器队列的名称
        # 稍后我们需要将一个起始url手动添加到redis_key表示的队列中
    
        rules = (
            Rule(LinkExtractor(allow=r'id=1&page=d+'), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            # 将全站的标题获取
            li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
            for li in li_list:
                title = li.xpath('./span[3]/a/text()').extract_first()
                item = FbsproItem()
                item['title']=title
                yield item
    
    
    

    2,对settings.py进行配置

    • 指定调度器
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    • 指定管道
    #开启使用scrapy-redis组件中封装好的管道
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
    } # 该种管道只可以将item写入redis
    
    • 指定redis
    #在配置文件中进行爬虫程序链接redis的配置:
    
    REDIS_HOST = '127.0.0.1'
    REDIS_PORT = 6379
    # REDIS_ENCODING = 'utf-8'
    # REDIS_PARAMS = {'password':'123456'}
    

    完整代码

    # Scrapy settings for fbsPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'fbsPro'
    
    SPIDER_MODULES = ['fbsPro.spiders']
    NEWSPIDER_MODULE = 'fbsPro.spiders'
    #LOG_LEVEL = 'ERROR' #指定类型日志的输出(只输出错误信息)
    
    #设置UA伪装
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
    
    # Obey robots.txt rules
    #改成False不遵从robots协议
    ROBOTSTXT_OBEY = False
    
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    #开启使用scrapy-redis组件中封装好的管道
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
    }
    
    #在配置文件中进行爬虫程序链接redis的配置:
    
    REDIS_HOST = '127.0.0.1'
    REDIS_PORT = 6379
    # REDIS_ENCODING = 'utf-8'
    # REDIS_PARAMS = {'password':'123456'}
    
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    CONCURRENT_REQUESTS = 5
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'fbsPro.middlewares.FbsproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'fbsPro.middlewares.FbsproDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    # ITEM_PIPELINES = {
    #    'fbsPro.pipelines.FbsproPipeline': 300,
    # }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    

    3,配置redis的配置文件redis.windows.conf

    • 解除默认绑定

      • #bind 127.0.0.1 注释掉(第56行)
    • 关闭保护模式

      • protected-mode no 把yes改成no(第75行)
    • redis运行时出错
      Creating Server TCP listening socket 127.0.0.1:6379: bind: No error

    • 解决方法,依次输入以下命令

      • redis-cli.exe
      • shutdown
      • exit
      • redis-server redis.windows.conf
    • 启动redis服务和客户端

      • redis-server redis.windows.conf
        在这里插入图片描述
    • redis-cli在这里插入图片描述

    5 执行scrapy工程

    • 不要在配置文件中加入#LOG_LEVEL = ‘ERROR’
    • 工程启动后 程序回停留在listening位置,等待其实url的加入
      在这里插入图片描述

    6 想redis_key表示的队列中添加起始url

    • 需要在redis的客户端执行如下指令:(调度器队列式存在与redis中)
    • lpush sunQueue http://wz.sun0769.com/political/index/politicsNewest?id=1&page=
      在这里插入图片描述
      在这里插入图片描述

    我们查看数据库可以看见爬取到的数据
    在这里插入图片描述

  • 相关阅读:
    当Django模型迁移时,报No migrations to apply 问题时
    django--各个文件的含义
    django--创建项目
    1013. Battle Over Cities (25)
    1011. World Cup Betting (20)
    1009. Product of Polynomials (25)
    1007. Maximum Subsequence Sum (25)
    1006. Sign In and Sign Out (25)
    1008. Elevator (20)
    1004. Counting Leaves (30)
  • 原文地址:https://www.cnblogs.com/gemoumou/p/13635323.html
Copyright © 2011-2022 走看看