zoukankan      html  css  js  c++  java
  • Scrapy从脚本运行爬虫的5种方式

    一、命令行运行爬虫

    1、运行爬虫(2种方式)
    运行爬虫
    $ scrapy crawl spidername

    在没有创建项目的情况下运行爬虫
    $ scrapy runspider spidername .py

    二、文件中运行爬虫


    1、cmdline方式运行爬虫

    # -*- coding: utf-8 -*-
    
    from scrapy import cmdline, Spider
    
    
    class BaiduSpider(Spider):
        name = 'baidu'
    
        start_urls = ['http://baidu.com/']
    
        def parse(self, response):
            self.log("run baidu")
    
    
    if __name__ == '__main__':
        cmdline.execute("scrapy crawl baidu".split())


    2、CrawlerProcess方式运行爬虫

    # -*- coding: utf-8 -*-
    
    from scrapy import Spider
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    class BaiduSpider(Spider):
        name = 'baidu'
    
        start_urls = ['http://baidu.com/']
    
        def parse(self, response):
            self.log("run baidu")
    
    
    if __name__ == '__main__':
        # 通过方法 get_project_settings() 获取配置信息
        process = CrawlerProcess(get_project_settings())
        process.crawl(BaiduSpider)
        process.start()

    3、通过CrawlerRunner 运行爬虫

    # -*- coding: utf-8 -*-
    
    from scrapy import Spider
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from twisted.internet import reactor
    
    
    class BaiduSpider(Spider):
        name = 'baidu'
    
        start_urls = ['http://baidu.com/']
    
        def parse(self, response):
            self.log("run baidu")
    
    
    if __name__ == '__main__':
        # 直接运行控制台没有日志
        configure_logging(
            {
                'LOG_FORMAT': '%(message)s'
            }
        )
    
        runner = CrawlerRunner()
    
        d = runner.crawl(BaiduSpider)
        d.addBoth(lambda _: reactor.stop())
        reactor.run()

    三、文件中运行多个爬虫


    项目中新建一个爬虫 SinaSpider

    # -*- coding: utf-8 -*-
    
    from scrapy import Spider
    
    
    class SinaSpider(Spider):
        name = 'sina'
    
        start_urls = ['https://www.sina.com.cn/']
    
        def parse(self, response):
            self.log("run sina")


    1、cmdline方式不可以运行多个爬虫
    如果将两个语句放在一起,第一个语句执行完后程序就退出了,执行到不到第二句

    # -*- coding: utf-8 -*-
    
    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl baidu".split())
    cmdline.execute("scrapy crawl sina".split())


    使用 cmdline运行多个爬虫的脚本

    from multiprocessing import Process
    from scrapy import cmdline
    import time
    import logging
    
    # 配置参数即可, 爬虫名称,运行频率
    confs = [
        {
            "spider_name": "unit42",
            "frequency": 2,
        },
        {
            "spider_name": "cybereason",
            "frequency": 2,
        },
        {
            "spider_name": "Securelist",
            "frequency": 2,
        },
        {
            "spider_name": "trendmicro",
            "frequency": 2,
        },
        {
            "spider_name": "yoroi",
            "frequency": 2,
        },
        {
            "spider_name": "weibi",
            "frequency": 2,
        },
    ]
    
    
    def start_spider(spider_name, frequency):
        args = ["scrapy", "crawl", spider_name]
        while True:
            start = time.time()
            p = Process(target=cmdline.execute, args=(args,))
            p.start()
            p.join()
            logging.debug("### use time: %s" % (time.time() - start))
            time.sleep(frequency)
    
    
    if __name__ == '__main__':
        for conf in confs:
            process = Process(target=start_spider,
                              args=(conf["spider_name"], conf["frequency"]))     #这里会无限循环???
            process.start()
            time.sleep(10)


    不过有了以下两个方法来替代,就更优雅了

    2、CrawlerProcess方式运行多个爬虫
    备注:爬虫项目文件为:
    scrapy_demo/spiders/baidu.py
    scrapy_demo/spiders/sina.py

    # -*- coding: utf-8 -*-
    
    from scrapy.crawler import CrawlerProcess
    
    from scrapy_demo.spiders.baidu import BaiduSpider
    from scrapy_demo.spiders.sina import SinaSpider
    
    process = CrawlerProcess()
    process.crawl(BaiduSpider)
    process.crawl(SinaSpider)
    process.start()

    此方式运行,发现日志中中间件只启动了一次,而且发送请求基本是同时的,说明这两个爬虫运行不是独立的,可能会相互干扰

    3、通过CrawlerRunner 运行多个爬虫

    # -*- coding: utf-8 -*-
    
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from twisted.internet import reactor
    
    from scrapy_demo.spiders.baidu import BaiduSpider
    from scrapy_demo.spiders.sina import SinaSpider
    
    
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(BaiduSpider)
    runner.crawl(SinaSpider)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())
    
    reactor.run()

    此方式也只加载一次中间件,不过是逐个运行的,会减少干扰,官方文档也推荐使用此方法来运行多个爬虫

    总结


    方式 是否读取settings.py 运行数量
    $ scrapy crawl baidu              读取     单个
    $ scrapy runspider baidu.py    读取     单个
    cmdline.execute                    读取     单个(推荐)
    CrawlerProcess                     不读取  单个,多个
    CrawlerRunner                      不读取 单个,多个(推荐)
    cmdline.execute                    运行单个爬虫文件的配置最简单,一次配置,多次运行

  • 相关阅读:
    Python 生成器
    Python 装饰器
    Go语言【第十四篇】:Go语言基础总结
    Go语言【第十三篇】:Go语言递归函数
    Go语言【第十二篇】:Go数据结构之:切片(Slice)、范围(Range)、集合(Map)
    Go语言【第十一篇】:Go数据结构之:结构体
    Java入门之:对象和类
    Alpha阶段第2周/共2周 Scrum立会报告+燃尽图 04
    Alpha阶段第2周/共2周 Scrum立会报告+燃尽图 03
    Alpha阶段第2周/共2周 Scrum立会报告+燃尽图 02
  • 原文地址:https://www.cnblogs.com/andy0816/p/15575507.html
Copyright © 2011-2022 走看看