zoukankan      html  css  js  c++  java
  • Scrapy怎样同时运行多个爬虫?

      默认情况下,当你运行 scrapy crawl 命令的时候,scrapy只能在单个进程里面运行一个爬虫。然后Scrapy运行方式除了采用命令行式的运行方式以外还可以使用API的方式来运行爬虫,而采用API的方式运行的爬虫是支持运行多个爬虫的。

      下面的案例是运行多个爬虫:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess() # 初始化事件循环
    process.crawl(MySpider1) # 将爬虫类方式事件循环
    process.crawl(MySpider2) # 将爬虫类方式事件循环
    process.start() # the script will block here until all crawling jobs are finished
    

      此外采用 CrawlerRunner 也是可行的:

    import scrapy
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until all crawling jobs are finished
    

      deferreds的方式来运行:

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(MySpider1)
        yield runner.crawl(MySpider2)
        reactor.stop()
    
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
    

      更多细节参考:

           Scrapy文档

  • 相关阅读:
    5 构建Mysql+heartbeat+DRBD+LVS集群应用系统系列之生产环境下drbd裂脑处理
    elk系列8之logstash+redis+es的架构来收集apache的日志
    elk系列7之通过grok分析apache日志
    elk系列6之tcp模块的使用
    elk系列5之syslog的模块使用
    elk系列4之kibana图形化操作
    elk系列3之通过json格式采集Nginx日志
    elk系列2之multiline模块的使用
    Docker探索系列2之镜像打包与DockerFile
    elk系列1之入门安装与基本操作
  • 原文地址:https://www.cnblogs.com/renshaoqi/p/11177166.html
Copyright © 2011-2022 走看看