zoukankan      html  css  js  c++  java
  • 从脚本中运行Scrapy

    文档: https://www.osgeo.cn/scrapy/topics/practices.html

    1、scrapy.crawler.CrawlerProcess

      Scrapy构建于Twisted异步网络框架基础之上,因此你需要在Twisted reactor里面运行。

      可以使用scrapy.crawler.CrawlerProcess这个类来运行你的spider,这个类会为你启动一个Twisted reactor,并能配置你的日志和shutdown处理器。所有的scrapy命令都使用这个类。

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    process = CrawlerProcess(get_project_settings())
    
    # 'followall' is the name of one of the spiders of the project.
    process.crawl('followall', domain='scrapinghub.com')
    process.start() # the script will block here until the crawling is finished

      使用 get_project_settings 得到一个 Settings 具有项目设置的实例。

    2、scrapy.crawler.CrawlerRunner

      使用这个类,在调度spider之后应该显式地运行reactor。建议您使用 CrawlerRunner 而不是 CrawlerProcess 如果您的应用程序已经在使用Twisted,并且您希望在同一个反应器中运行Scrapy。

    from twisted.internet import reactor
    import scrapy
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider(scrapy.Spider):
        # Your spider definition
        ...
    
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    
    d = runner.crawl(MySpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

    3、在同一进程中运行多个spider

      默认情况下,当您运行时,scrapy为每个进程运行一个spider scrapy crawl . 但是,Scrapy支持使用 internal API .

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    process = CrawlerProcess()
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start() # the script will block here until all crawling jobs are finished
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(MySpider1)
    runner.crawl(MySpider2)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until all crawling jobs are finished

    通过链接延迟来按顺序运行spider

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...
    
    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...
    
    configure_logging()
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
        yield runner.crawl(MySpider1)
        yield runner.crawl(MySpider2)
        reactor.stop()
    
    crawl()
    reactor.run() # the script will block here until the last crawl call is finished
  • 相关阅读:
    如何不传入对象就获得某对象的方法---ThreadLocal类
    Linux系统主目录被更改,怎么修改回去?
    tree命令的安装
    Linux命令学习man
    当重载函数的参数是Object和Object数组的时候会发生什么情况!!!
    Linux学习(二)之内核、系统调用、库
    使用puttygen转换OpenSSH SSH2私钥为RSA PEM格式
    docker-compose使用详解
    svn迁移到gitlab
    linux快速启动http端口
  • 原文地址:https://www.cnblogs.com/Mint-diary/p/14507583.html
Copyright © 2011-2022 走看看