zoukankan      html  css  js  c++  java
  • 从脚本中运行Scrapy

    文档: https://www.osgeo.cn/scrapy/topics/practices.html

    1、scrapy.crawler.CrawlerProcess

      Scrapy构建于Twisted异步网络框架基础之上,因此你需要在Twisted reactor里面运行。

      可以使用scrapy.crawler.CrawlerProcess这个类来运行你的spider,这个类会为你启动一个Twisted reactor,并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    process = CrawlerProcess(get_project_settings())
    
    # 'followall' is the name of one of the spiders of the project.
    process.crawl('followall', domain='scrapinghub.com')
    process.start() # the script will block here until the crawling is finished

      使用 get_project_settings 得到一个 Settings 具有项目设置的实例。

    2、scrapy.crawler.CrawlerRunner

      使用这个类,在调度spider之后应该显式地运行reactor。建议您使用 CrawlerRunner 而不是 CrawlerProcess 如果您的应用程序已经在使用Twisted,并且您希望在同一个反应器中运行Scrapy。

    from twisted.internet import reactor
    import scrapy
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider(scrapy.Spider):
        # Your spider definition
        ...
    
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    
    d = runner.crawl(MySpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

    3、在同一进程中运行多个spider

      默认情况下,当您运行时,scrapy为每个进程运行一个spider scrapy crawl . 但是,Scrapy支持使用 internal API .

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start() # the script will block here until all crawling jobs are finished
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until all crawling jobs are finished

    通过链接延迟来按顺序运行spider

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(MySpider1)
        yield runner.crawl(MySpider2)
        reactor.stop()
    
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
  • 相关阅读:
    泛型应用----泛型接口、泛型方法、泛型数组、泛型嵌套
    有选择性的启用SAP UI5调试版本的源代码
    SAP UI5应用入口App.controller.js是如何被UI5框架加载的?
    SAP WebIDE里UI5应用的隐藏文件project.json
    SAP UI5的support Assistant
    如何用SAP WebIDE的Fiori创建向导基于ABAP OData service快速创建UI5应用
    SAP Cloud Platform上Destination属性为odata_gen的具体用途
    Marketing Cloud contact主数据的csv导入
    Marketing Cloud的contact merge机制
    如何让某些用户对Marketing Cloud的contact数据只能实施只读操作
  • 原文地址:https://www.cnblogs.com/Mint-diary/p/14507583.html
Copyright © 2011-2022 走看看