zoukankan      html  css  js  c++  java
  • Scrapy怎样同时运行多个爬虫?

      默认情况下,当你运行 scrapy crawl 命令的时候,scrapy只能在单个进程里面运行一个爬虫。然后Scrapy运行方式除了采用命令行式的运行方式以外还可以使用API的方式来运行爬虫,而采用API的方式运行的爬虫是支持运行多个爬虫的。

      下面的案例是运行多个爬虫:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess() # 初始化事件循环
    process.crawl(MySpider1) # 将爬虫类方式事件循环
    process.crawl(MySpider2) # 将爬虫类方式事件循环
    process.start() # the script will block here until all crawling jobs are finished
    

      此外采用 CrawlerRunner 也是可行的:

    import scrapy
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until all crawling jobs are finished
    

      deferreds的方式来运行:

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(MySpider1)
        yield runner.crawl(MySpider2)
        reactor.stop()
    
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
    

      更多细节参考:

           Scrapy文档

  • 相关阅读:
    #include <utility>
    Html的空格显示
    ExtJs自学教程(1):一切从API開始
    天黑的时候,我又想起那首歌
    citrix协议ICA技术原理
    约瑟夫环问题
    数据结构和算法设计专题之---八大内部排序
    HDU
    深入分析C++引用
    八大排序算法总结
  • 原文地址:https://www.cnblogs.com/renshaoqi/p/11177166.html
Copyright © 2011-2022 走看看