zoukankan      html  css  js  c++  java
  • python scrapy同时执行spiders多个爬虫

    假设spiders文件夹下多个文件:

    name.py     name = 'name'

    name1.py    name = 'name1'

    name2.py    name = 'name2'

    ...

    这里可以根据上篇文章http://www.cnblogs.com/chaihy/p/9044574.html  

    根据条件查询的列表,查询的时候可以设置where 前1000条,1000-2000条,2000-3000条 ... 可以同时爬取文件相当于多进程处理

    首先创建commands文件夹 和 spiders同级目录

    commands 文件夹创建文件:

            crawlall.py文件

            __init__.py空文件 

    crawlall.py文件内容如下:(获取spiders文件夹下所有的文件)

    from scrapy.commands import ScrapyCommand
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.conf import arglist_to_dict
    class Command(ScrapyCommand):
    requires_project = True
    def syntax(self):
    return '[options]'
    def short_desc(self):
    return 'Runs all of the spiders'
    def add_options(self, parser):
    ScrapyCommand.add_options(self, parser)
    parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
    help="set spider argument (may be repeated)")
    parser.add_option("-o", "--output", metavar="FILE",
    help="dump scraped items into FILE (use - for stdout)")
    parser.add_option("-t", "--output-format", metavar="FORMAT",
    help="format to use for dumping items with -o")
    def process_options(self, args, opts):
    ScrapyCommand.process_options(self, args, opts)
    try:
    opts.spargs = arglist_to_dict(opts.spargs)
    except ValueError:
    pass
    # raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
    def run(self, args, opts):
    #settings = get_project_settings()

    spider_loader = self.crawler_process.spider_loader
    for spidername in args or spider_loader.list():
    print "*********cralall spidername************" + spidername
    self.crawler_process.crawl(spidername, **opts.spargs)
    self.crawler_process.start()

    settings 配置:

    COMMANDS_MODULE = 'project.commands'

    执行命令:

    scrapy crawlall
  • 相关阅读:
    regulation
    Java第三方类库
    python整个小服务器
    VsFTP出现500 OOPS: cannot change directory的解决办法
    Got error 28 from storage engine
    linux下ftp操作
    linux安装JDK
    Apache + Tomcat + Linux 集群和均衡负载 (Session 同步复制) 配置实
    怎么样才能使得PL/SQL Developer不显示系统表?
    sudo 用法
  • 原文地址:https://www.cnblogs.com/chaihy/p/9044792.html
Copyright © 2011-2022 走看看