zoukankan      html  css  js  c++  java
  • 通过核心API启动单个或多个scrapy爬虫

    1. 可以使用API从脚本运行Scrapy,而不是运行Scrapy的典型方法scrapy crawl;Scrapy是基于Twisted异步网络库构建的,因此需要在Twisted容器内运行它,可以通过两个API来运行单个或多个爬虫scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。

    2. 启动爬虫的的第一个实用程序是scrapy.crawler.CrawlerProcess 。该类将为您启动Twisted reactor,配置日志记录并设置关闭处理程序,此类是所有Scrapy命令使用的类。

    示例运行单个爬虫:

    交流群:1029344413 源码、素材学习资料
    import
    scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition ... process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

    通过CrawlerProcess传入参数,并使用get_project_settings获取Settings 项目设置的实例。

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    
    process = CrawlerProcess(get_project_settings())
    
    # 'followall' is the name of one of the spiders of the project.
    
    process.crawl('followall', domain='scrapinghub.com')
    
    process.start() # the script will block here until the crawling is finished
    1. 还有另一个Scrapy实例方式可以更好地控制爬虫运行过程:scrapy.crawler.CrawlerRunner。此类封装了一些简单的帮助程序来运行多个爬虫程序,但它不会以任何方式启动或干扰现有的爬虫。
    2. 使用此类,显式运行reactor。如果已有爬虫在运行想在同一个进程中开启另一个Scrapy,建议您使用CrawlerRunner 而不是CrawlerProcess。
    3. 注意,爬虫结束后需要手动关闭Twisted reactor,通过向CrawlerRunner.crawl方法返回的延迟添加回调来实现。

      下面是它的用法示例,在MySpider完成运行后手动停止容器的回调。

    from twisted.internet import reactor
    import scrapy
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider(scrapy.Spider):
    
        # Your spider definition
    
        ...
    
    
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    
    runner = CrawlerRunner()
    
    
    d = runner.crawl(MySpider)
    
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run() # the script will block here until the crawling is finished

    在同一个进程中运行多个蜘蛛

    默认情况下,Scrapy在您运行时为每个进程运行一个蜘蛛。但是,Scrapy支持使用内部API为每个进程运行多个蜘蛛。

    这是一个同时运行多个蜘蛛的示例:

    import scrapy
    from scrapy.crawler import CrawlerProcess
    
    class MySpider1(scrapy.Spider):
    
        # Your first spider definition
    
        ...
    
    class MySpider2(scrapy.Spider):
    
        # Your second spider definition
    
        ...
    
    
    process = CrawlerProcess()
    
    process.crawl(MySpider1)
    
    process.crawl(MySpider2)
    
    process.start() # the script will block here until all crawling jobs are finished

    使用CrawlerRunner示例:

    import scrapy
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
    
        # Your first spider definition
    
        ...
    
    class MySpider2(scrapy.Spider):
    
        # Your second spider definition
    
        ...
    
    
    configure_logging()
    
    runner = CrawlerRunner()
    
    runner.crawl(MySpider1)
    
    runner.crawl(MySpider2)
    
    d = runner.join()
    
    d.addBoth(lambda _: reactor.stop())
    
    
    reactor.run() # the script will block here until all crawling jobs are finished

    相同的示例,但通过异步运行爬虫蛛:

    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    
    class MySpider1(scrapy.Spider):
    
        # Your first spider definition
    
        ...
    
    class MySpider2(scrapy.Spider):
    
        # Your second spider definition
    
        ...
    
    
    configure_logging()
    
    runner = CrawlerRunner()
    
    @defer.inlineCallbacks
    def crawl():
    
        yield runner.crawl(MySpider1)
    
        yield runner.crawl(MySpider2)
    
        reactor.stop()
    
    
    crawl()
    
    reactor.run() # the script will block here until the last crawl call is finished
  • 相关阅读:
    Java代码打成jar后 classgetClassLoadergetResource("")返回为null
    springboot-yml内list、map组合写法
    rpc-java 生成代码路径设置
    Git操作 :从一个分支cherry-pick多个commit到其他分支
    使用maven插件生成grpc所需要的Java代码
    'Failed to import pydot. You must `pip install pydot` and install graphviz
    seasonal_decompose plot figsize
    Failed to install 'TwoSampleMR' from GitHub
    prophet Building wheel for fbprophet (setup.py) ... error
    python matplotlib 绘图线条类型和颜色选择
  • 原文地址:https://www.cnblogs.com/pypypy/p/12207716.html
Copyright © 2011-2022 走看看