zoukankan      html  css  js  c++  java
  • 如何用脚本方式启动scrapy爬虫

    众所周知,直接通过命令行scrapy crawl yourspidername可以启动项目中名为yourspidername的爬虫。在python脚本中可以调用cmdline模块来启动命令行:

    $ cat yourspider1start.py
    from scrapy import cmdline
    
    # 方法 1
    cmdline.execute('scrapy crawl yourspidername'.split())
    
    # 方法 2
    sys.argv = ['scrapy', 'crawl', 'down_info_spider']
    cmdline.execute()
    
    # 方法 3, 创建子进程执行外部程序。方法仅仅返回外部程序的执行结果。0表示执行成功。
    os.system('scrapy crawl down_info_spider')
    
    # 方法 4
    import subprocess
    subprocess.Popen('scrapy crawl down_info_spider') 
    

    其中,在方法3、4中,推荐subprocess

    subprocess module intends to replace several other, older modules and functions, such as:

    os.system
    os.spawn*
    os.popen*
    popen2.*
    commands.*

    通过其返回值的poll方法可以判断子进程是否执行结束

    我们也可以直接通过shell脚本每隔2秒启动所有爬虫:

    $ cat startspiders.sh
    #!/usr/bin/env bash
    count=0
    while [ $count -lt $1 ];
    do
      sleep 2 
      nohup python yourspider1start.py >/dev/null 2>&1 &
      nohup python yourspider2start.py >/dev/null 2>&1 &
      let count+=1
    done
    

    以上方法本质上都是启动scrapy命令行。如何通过调用scrapy内部函数,在编程方式下启动爬虫呢?

    官方文档给出了两个scrapy工具:

    1. scrapy.crawler.CrawlerRunner, runs crawlers inside an already setup Twisted reactor
    2. scrapy.crawler.CrawlerProcess, 父类是CrawlerRunner

    scrapy框架基于Twisted异步网络库,CrawlerRunner和CrawlerProcess帮助我们从Twisted reactor内部启动scrapy。

    直接使用CrawlerRunner可以更精细的控制crawler进程,要手动指定Twisted reactor关闭后的回调函数。指定如果不打算在应用程序中运行更多的Twisted reactor,使用子类CrawlerProcess则更合适。

    下面简单是文档中给的用法示例:

    # encoding: utf-8
    __author__ = 'fengshenjie'
    from twisted.internet import reactor
    from scrapy.utils.project import get_project_settings
    
    def run1_single_spider():
        '''Running spiders outside projects
        只调用spider,不会进入pipeline'''
        from scrapy.crawler import CrawlerProcess
        from scrapy_test1.spiders import myspider1
        process = CrawlerProcess({
            'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
        })
    
        process.crawl(myspider1)
        process.start()  # the script will block here until the crawling is finished
    
    def run2_inside_scrapy():
        '''会启用pipeline'''
        from scrapy.crawler import CrawlerProcess
        process = CrawlerProcess(get_project_settings())
        process.crawl('spidername') # scrapy项目中spider的name值
        process.start()
    
    def spider_closing(arg):
        print('spider close')
        reactor.stop()
    
    def run3_crawlerRunner():
        '''如果你的应用程序使用了twisted,建议使用crawlerrunner 而不是crawlerprocess
        Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the CrawlerRunner.crawl method.
        '''
        from scrapy.crawler import CrawlerRunner
        runner = CrawlerRunner(get_project_settings())
    
        # 'spidername' is the name of one of the spiders of the project.
        d = runner.crawl('spidername')
        
        # stop reactor when spider closes
        # d.addBoth(lambda _: reactor.stop())
        d.addBoth(spider_closing) # 等价写法
    
        reactor.run()  # the script will block here until the crawling is finished
    
    def run4_multiple_spider():
        from scrapy.crawler import CrawlerProcess
        process = CrawlerProcess()
    
        from scrapy_test1.spiders import myspider1, myspider2
        for s in [myspider1, myspider2]:
            process.crawl(s)
        process.start()
    
    def run5_multiplespider():
        '''using CrawlerRunner'''
        from twisted.internet import reactor
        from scrapy.crawler import CrawlerRunner
        from scrapy.utils.log import configure_logging
    
        configure_logging()
        runner = CrawlerRunner()
        from scrapy_test1.spiders import myspider1, myspider2
        for s in [myspider1, myspider2]:
            runner.crawl(s)
    
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
    
        reactor.run()  # the script will block here until all crawling jobs are finished
    
    def run6_multiplespider():
        '''通过链接(chaining) deferred来线性运行spider'''
        from twisted.internet import reactor, defer
        from scrapy.crawler import CrawlerRunner
        from scrapy.utils.log import configure_logging
        configure_logging()
        runner = CrawlerRunner()
    
        @defer.inlineCallbacks
        def crawl():
            from scrapy_test1.spiders import myspider1, myspider2
            for s in [myspider1, myspider2]:
                yield runner.crawl(s)
            reactor.stop()
    
        crawl()
        reactor.run()  # the script will block here until the last crawl call is finished
    
    
    if __name__=='__main__':
        # run4_multiple_spider()
        # run5_multiplespider()
        run6_multiplespider()
    

    References

    1. 编程方式下运行 Scrapy spider, 基于scrapy1.0版本
  • 相关阅读:
    android系统webview使用input实现选择文件并预览
    在列表中动态设置元素的id
    Vue使用Clipboard.JS在h5页面中复制内容
    Vue使用v-for显示列表时,数组里的item数据更新,视图中列表不同步更新的解决方法
    Vue子组件和根组件的关系
    Vue生命周期和钩子函数及使用keeplive缓存页面不重新加载
    Python与数据结构[3] -> 树/Tree[0] -> 二叉树及遍历二叉树的 Python 实现
    Python与数据结构[2] -> 队列/Queue[0] -> 数组队列的 Python 实现
    Python与数据结构[1] -> 栈/Stack[1] -> 中缀表达式与后缀表达式的转换和计算
    Python与数据结构[1] -> 栈/Stack[0] -> 链表栈与数组栈的 Python 实现
  • 原文地址:https://www.cnblogs.com/lawlietfans/p/7475742.html
Copyright © 2011-2022 走看看