常见的启动方式
scrapy crawl spider_name
官方提供的启动方式
使用脚本启动
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess(settings={
"FEEDS": {
"items.json": {"format": "json"},
},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
python run.py
使用CrawlerRunner
启动,推荐
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished
使用脚本启动
在settings.py
中添加
COMMANDS_MODULE = "commands"
在scrapy.cfg
同级目录创建commandas/startall.py
文件
这里我的scrapy的版本应该是2.2.0
,如果是1.8.0
则参照scrapy/commands/crawl.py
修改
from scrapy.commands import BaseRunSpiderCommand
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return "[options] <spider>"
def short_desc(self):
return "Run all spider"
def run(self, args, opts):
for spider_name in sorted(self.crawler_process.spider_loader.list()):
self.crawler_process.crawl(spider_name, **opts.spargs)
self.crawler_process.start()
if self.crawler_process.bootstrap_failed:
self.exitcode = 1