zoukankan      html  css  js  c++  java
  • Scrapy怎样同时运行多个爬虫?

      默认情况下,当你运行 scrapy crawl 命令的时候,scrapy只能在单个进程里面运行一个爬虫。然后Scrapy运行方式除了采用命令行式的运行方式以外还可以使用API的方式来运行爬虫,而采用API的方式运行的爬虫是支持运行多个爬虫的。

      下面的案例是运行多个爬虫:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess() # 初始化事件循环
    process.crawl(MySpider1) # 将爬虫类方式事件循环
    process.crawl(MySpider2) # 将爬虫类方式事件循环
    process.start() # the script will block here until all crawling jobs are finished
    

      此外采用 CrawlerRunner 也是可行的:

    import scrapy
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until all crawling jobs are finished
    

      deferreds的方式来运行:

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(MySpider1)
        yield runner.crawl(MySpider2)
        reactor.stop()
    
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
    

      更多细节参考:

           Scrapy文档

  • 相关阅读:
    [转]: 浅谈Java中的equals和==
    易忘易混的java基本概念
    mysql查看锁表锁进程
    [转] Python 包管理工具解惑
    双网卡单网关的路由问题
    [转]火狐 SSL 收到了一个弱临时 Diffie-Hellman 密钥
    Linux中如何进入减号开头的目录中
    zabbix的一点记录
    从图形界面配置zabbix
    调用API自动配置zabbix version 3.0
  • 原文地址:https://www.cnblogs.com/renshaoqi/p/11177166.html
Copyright © 2011-2022 走看看