zoukankan      html  css  js  c++  java
  • 22-爬虫之scrapy框架分布式09

    分布式

    在这里插入图片描述

    • 实现分布式的方式:scrapy+redis(scrapy结合着scrapy-redis组件)
    • 原生的scrapy框架是无法实现分布式的
      • 什么是分布式
        • 需要搭建一个分布式机群,然后让机群中的每一台电脑执行同一组程序,让其对同一组资源进行联合且分布的数据爬取。
        • 因调度器,管道无法被分布式机群共享所以原生架构scrapy无法实现分布式
        • 使用scrapy-redis组件可以给原生的scrapy框架提供共享管道和调度器实现分布式
          • pip install scrapy-redis

    实现流程

    创建工程

    创建一个爬虫工程:scrapy startproject proName
    进入工程创建一个基于CrawlSpider的爬虫文件
    scrapy genspider -t crawl spiderName www.xxx.com
    执行工程:scrapy crawl spiderName

    1,修改爬虫文件

    • 1.1 导包:from scrapy_redis.spiders import RedisCrawlSpider
    • 1.2 修改当前爬虫类的父类为:RedisCrawlSpider
    • 1.3 将start_url 替换成redis_key的属性,属性值为任意字符串
      • redis_key=‘xxxx’ #可以被共享的调度器队列的名称/稍后我们需要将一个起始url手动添加到redis_key表示的队列中
    • 1.4 将数据解析的操作补充完成即可

    fbs.py 爬虫源文件

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy_redis.spiders import RedisCrawlSpider
    from fbsPro.items import FbsproItem
    
    class FbsSpider(RedisCrawlSpider):
        name = 'fbs'
        #allowed_domains = ['www.xxx.com']
        #start_urls = ['http://www.xxx.com/']
        redis_key = 'sunQueue' #可以被共享的调度器队列的名称
        # 稍后我们需要将一个起始url手动添加到redis_key表示的队列中
    
        rules = (
            Rule(LinkExtractor(allow=r'id=1&page=d+'), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            # 将全站的标题获取
            li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
            for li in li_list:
                title = li.xpath('./span[3]/a/text()').extract_first()
                item = FbsproItem()
                item['title']=title
                yield item
    
    
    

    2,对settings.py进行配置

    • 指定调度器
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    • 指定管道
    #开启使用scrapy-redis组件中封装好的管道
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
    } # 该种管道只可以将item写入redis
    
    • 指定redis
    #在配置文件中进行爬虫程序链接redis的配置:
    
    REDIS_HOST = '127.0.0.1'
    REDIS_PORT = 6379
    # REDIS_ENCODING = 'utf-8'
    # REDIS_PARAMS = {'password':'123456'}
    

    完整代码

    # Scrapy settings for fbsPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'fbsPro'
    
    SPIDER_MODULES = ['fbsPro.spiders']
    NEWSPIDER_MODULE = 'fbsPro.spiders'
    #LOG_LEVEL = 'ERROR' #指定类型日志的输出(只输出错误信息)
    
    #设置UA伪装
    USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'
    
    # Obey robots.txt rules
    #改成False不遵从robots协议
    ROBOTSTXT_OBEY = False
    
    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True
    
    #开启使用scrapy-redis组件中封装好的管道
    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
    }
    
    #在配置文件中进行爬虫程序链接redis的配置:
    
    REDIS_HOST = '127.0.0.1'
    REDIS_PORT = 6379
    # REDIS_ENCODING = 'utf-8'
    # REDIS_PARAMS = {'password':'123456'}
    
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    CONCURRENT_REQUESTS = 5
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'fbsPro.middlewares.FbsproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'fbsPro.middlewares.FbsproDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    # ITEM_PIPELINES = {
    #    'fbsPro.pipelines.FbsproPipeline': 300,
    # }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    

    3,配置redis的配置文件redis.windows.conf

    • 解除默认绑定

      • #bind 127.0.0.1 注释掉(第56行)
    • 关闭保护模式

      • protected-mode no 把yes改成no(第75行)
    • redis运行时出错
      Creating Server TCP listening socket 127.0.0.1:6379: bind: No error

    • 解决方法,依次输入以下命令

      • redis-cli.exe
      • shutdown
      • exit
      • redis-server redis.windows.conf
    • 启动redis服务和客户端

      • redis-server redis.windows.conf
        在这里插入图片描述
    • redis-cli在这里插入图片描述

    5 执行scrapy工程

    • 不要在配置文件中加入#LOG_LEVEL = ‘ERROR’
    • 工程启动后 程序回停留在listening位置,等待其实url的加入
      在这里插入图片描述

    6 想redis_key表示的队列中添加起始url

    • 需要在redis的客户端执行如下指令:(调度器队列式存在与redis中)
    • lpush sunQueue http://wz.sun0769.com/political/index/politicsNewest?id=1&page=
      在这里插入图片描述
      在这里插入图片描述

    我们查看数据库可以看见爬取到的数据
    在这里插入图片描述

  • 相关阅读:
    POJ 1185 炮兵阵地 经典的 状态压缩dp
    hdu 1565 方格取数(1) 状态压缩dp
    poj Corn Fields 状态压缩dp。
    fzu 2138 久违的月赛之一 容斥。
    fzu 2136 取糖果 好几种方法解决。
    hdu 1231 最大连续子序列
    选择排序
    SharedPrefernces使用实例讲解
    SharedPrefernces使用实例讲解
    可以ping通,但是不能connect
  • 原文地址:https://www.cnblogs.com/gemoumou/p/13635323.html
Copyright © 2011-2022 走看看