zoukankan      html  css  js  c++  java
  • scrapy 启动多个爬虫

    常见的启动方式

    scrapy crawl spider_name
    

    官方提供的启动方式

    使用脚本启动

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider(scrapy.Spider):
        # Your spider definition
        ...
    
    process = CrawlerProcess(settings={
        "FEEDS": {
            "items.json": {"format": "json"},
        },
    })
    
    process.crawl(MySpider)
    process.start() # the script will block here until the crawling is finished
    
    python run.py
    

    使用CrawlerRunner启动,推荐

    import scrapy
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until all crawling jobs are finished
    

    使用脚本启动

    settings.py中添加

    COMMANDS_MODULE = "commands"
    

    scrapy.cfg同级目录创建commandas/startall.py文件

    这里我的scrapy的版本应该是2.2.0,如果是1.8.0则参照scrapy/commands/crawl.py修改

    from scrapy.commands import BaseRunSpiderCommand
    
    class Command(ScrapyCommand):
        requires_project = True
    
        def syntax(self):
            return "[options] <spider>"
    
        def short_desc(self):
            return "Run all spider"
    
        def run(self, args, opts):
            for spider_name in sorted(self.crawler_process.spider_loader.list()):
                self.crawler_process.crawl(spider_name, **opts.spargs)
            self.crawler_process.start()
            if self.crawler_process.bootstrap_failed:
                self.exitcode = 1
    
  • 相关阅读:
    Epplus导出excel
    访问GitHub需要修改hosts
    如何将你的.Net Core程序部署成为服务
    生成雪花Id类
    文件操作帮助类
    工作流-WikeFlow
    《C语言进阶剖析》课程目录
    《C++深度解析》课程目录
    USB URB的status及其代表的意义
    数据结构优秀博文整理
  • 原文地址:https://www.cnblogs.com/iFanLiwei/p/13257462.html
Copyright © 2011-2022 走看看