zoukankan      html  css  js  c++  java
  • 分布式爬虫

    redis分布式部署

    scrapy框架是否可以自己实现分布式?

    不可以,原因有二
      其一:因为多台机器上部署的scrapy会各自拥有各自的调度器,这样就使得多台机器无法分配start_urls列表中的url。(多台机器无法共享同一个调度器)
      其二:多台机器爬取到的数据无法通过同一个管道对数据进行统一的数据持久出存储。(多台机器无法共享同一个管道)

    基于scrapy-redis组件的分布式爬虫

    scrapy-redis组件中为我们封装好了可以被多台机器共享的调度器和管道,我们可以直接使用并实现分布式数据爬取。

    搭建流程

    搭建流程:
            - 创建工程
            - 爬虫文件
            - 修改爬虫文件:
                - 导报:from scrapy_redis.spiders import RedisCrawlSpider
                - 将当前爬虫类的父类进行修改RedisCrawlSpider
                - allowed_domains,start_url删除,添加一个新属性redis_key(调度器队列的名称)
                - 数据解析,将解析的数据封装到item中然后向管道提交
            - 配置文件的编写:
                - 指定管道:
                                    ITEM_PIPELINES = {
                             'scrapy_redis.pipelines.RedisPipeline': 400
                            }
                - 指定调度器:
                    # 增加了一个去重容器类的配置, 作用使用Redis的set集合来存储请求的指纹数据, 从而实现请求去重的持久化
                    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
                    # 使用scrapy-redis组件自己的调度器
                    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
                    # 配置调度器是否要持久化, 也就是当爬虫结束了, 要不要清空Redis中请求队列和去重指纹的set。如果是True, 就表示要持久化存储, 就不清空数据, 否则清空数据
                    SCHEDULER_PERSIST = True
                - 指定具体的redis:
                    REDIS_HOST = 'redis服务的ip地址'
                    REDIS_PORT = 6379
                    REDIS_ENCODING = ‘utf-8’
                    REDIS_PARAMS = {‘password’:’123456’}
                - 开启redis服务(携带redis的配置文件:redis-server ./redis.windows.conf),和客户端:
                    - 对redis的配置文件进行适当的配置:
                            - #bind 127.0.0.1
                            - protected-mode no
                     - 开启
                 - 启动程序:scrapy runspider xxx.py
                 - 向调度器队列中扔入一个起始的url(redis的客户端):lpush xxx www.xxx.com
                    - xxx表示的就是redis_key的属性值

    实现方式:

    1.基于该组件的RedisSpider类
    2.基于该组件的RedisCrawlSpider类

    分布式实现流程:上述两种不同方式的分布式实现流程是统一的

    下载scrapy-redis组件:pip install scrapy-redis

    redis配置文件的配置:

    - 注释该行:bind 127.0.0.1,表示可以让其他ip访问redis
    - 将yes该为no:protected-mode no,表示可以让其他ip操作redis

    修改爬虫文件中的相关代码:

    - 将爬虫类的父类修改成基于RedisSpider或者RedisCrawlSpider。注意:如果原始爬虫文件是基于Spider的,则应该将父类修改成RedisSpider,如果原始爬虫文件是基于CrawlSpider的,则应该将其父类修改成RedisCrawlSpider。
    - 注释或者删除start_urls列表,切加入redis_key属性,属性值为scrpy-redis组件中调度器队列的名称

    在配置文件中进行相关配置,开启使用scrapy-redis组件中封装好的管道

    ITEM_PIPELINES = {
        'scrapy_redis.pipelines.RedisPipeline': 400
    }

    在配置文件中进行相关配置,开启使用scrapy-redis组件中封装好的调度器

    # 使用scrapy-redis组件的去重队列
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    # 使用scrapy-redis组件自己的调度器
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    # 是否允许暂停
    SCHEDULER_PERSIST = True

    在配置文件中进行爬虫程序链接redis的配置:

    REDIS_HOST = 'redis服务的ip地址'
    REDIS_PORT = 6379
    REDIS_ENCODING = ‘utf-8’
    REDIS_PARAMS = {‘password’:’123456’}
    开启redis服务器:redis-server 配置文件
    开启redis客户端:redis-cli
    运行爬虫文件:scrapy runspider SpiderFile
    向调度器队列中扔入一个起始url(在redis客户端中操作):lpush redis_key属性值 起始url

    示例一

    爬虫文件

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from redis import Redis
    from moviePro.items import MovieproItem
    class MovieSpider(CrawlSpider):
        conn = Redis(host='127.0.0.1',port=6379)
        name = 'movie'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.4567tv.tv/frim/index1.html']
    
        rules = (
            Rule(LinkExtractor(allow=r'/frim/index1-d+.html'), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            #解析出当前页码对应页面中电影详情页的url
            li_list = response.xpath('//div[@class="stui-pannel_bd"]/ul/li')
            for li in li_list:
                #解析详情页的url
                detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first()
                #ex == 1:该url没有被请求过  ex == 0:该url已经被请求过了
                ex = self.conn.sadd('movie_detail_urls',detail_url)
                if ex == 1:
                    print('有新数据可爬取......')
                    yield scrapy.Request(url=detail_url,callback=self.parse_detail)
                else:
                    print('暂无新数据可爬取!')
        def parse_detail(self,response):
            name = response.xpath('/html/body/div[1]/div/div/div/div[2]/h1/text()').extract_first()
            m_type = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[1]/a[1]/text()').extract_first()
            item = MovieproItem()
            item['name'] = name
            item['m_type'] = m_type
    
            yield item

    items文件

    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class MovieproItem(scrapy.Item):
        # define the fields for your item here like:
        name = scrapy.Field()
        m_type = scrapy.Field()

    管道文件

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class MovieproPipeline(object):
        def process_item(self, item, spider):
            conn = spider.conn
            dic = {
                'name':item['name'],
                'm_type':item['m_type']
            }
            conn.lpush('movie_data',dic)
            return item

    配置文件

    # Scrapy settings for moviePro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'moviePro'
    
    SPIDER_MODULES = ['moviePro.spiders']
    NEWSPIDER_MODULE = 'moviePro.spiders'
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'moviePro (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'moviePro.middlewares.MovieproSpiderMiddleware': 543,
    #}
    LOG_LEVEL = 'ERROR'
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'moviePro.middlewares.MovieproDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'moviePro.pipelines.MovieproPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    settings.py

    示例二:增量爬虫

    爬虫文件

    import scrapy
    from qiubaiPro.items import QiubaiproItem
    import hashlib
    from redis import Redis
    class QiubaiSpider(scrapy.Spider):
        name = 'qiubai'
        conn = Redis(host='127.0.0.1',port=6379)
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.qiushibaike.com/text/']
    
        def parse(self, response):
            div_list = response.xpath('//div[@id="content-left"]/div')
            for div in div_list:
                #数据指纹:爬取到一条数据的唯一标识
                author = div.xpath('./div/a[2]/h2/text() | ./div/span[2]/h2/text()').extract_first()
                content = div.xpath('./a/div/span//text()').extract()
                content = ''.join(content)
    
                item = QiubaiproItem()
                item['author'] = author
                item['content'] = content
    
                #数据指纹的创建
                data = author+content
                hash_key = hashlib.sha256(data.encode()).hexdigest()
                ex = self.conn.sadd('hash_keys',hash_key)
                if ex == 1:
                    print('有新数据更新......')
                    yield item
                else:
                    print('无数据更新!')

    items文件

    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    class QiubaiproItem(scrapy.Item):
        # define the fields for your item here like:
        author = scrapy.Field()
        content = scrapy.Field()

    配置文件

    # -*- coding: utf-8 -*-
    
    # Scrapy settings for qiubaiPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'qiubaiPro'
    
    SPIDER_MODULES = ['qiubaiPro.spiders']
    NEWSPIDER_MODULE = 'qiubaiPro.spiders'
    
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
    
    LOG_LEVEL = 'ERROR'
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'qiubaiPro (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'qiubaiPro.middlewares.QiubaiproSpiderMiddleware': 543,
    #}
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'qiubaiPro.middlewares.QiubaiproDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'qiubaiPro.pipelines.QiubaiproPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    View Code

            

            

    
    

         

                

            

    
    

           

    
    

            

    
    

       

       

  • 相关阅读:
    SQL------Hint
    JVM——垃圾回收
    JVM——内存结构
    SpringMVC——拦截器,过滤器实现登录拦截
    SpringMVC——参数传递
    SpringMVC——数据乱码问题
    SpringMVC——MVC执行流程底层剖析
    Spring——5种增强方式
    Spring——bean的五种作用域和生命周期
    Spring——多种方式实现依赖注入
  • 原文地址:https://www.cnblogs.com/wanglan/p/10840918.html
Copyright © 2011-2022 走看看