scrapy之crawls的暂停与重启

zoukankan html css js c++ java

scrapy之crawls的暂停与重启
Jobs: pausing and resuming crawls¹

Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.

Scrapy supports this functionality out of the box by providing the following facilities:
- a scheduler that persists scheduled requests on disk
- a duplicates filter that persists visited requests on disk
- an extension that keeps some spider state (key/value pairs) persistent between batches
Job directory

To enable persistence support you just need to define a job directory through the JOBDIR setting. This directory will be for storing all required data to keep the state of a single job (ie. a spider run). It’s important to note that this directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a single job.

How to use it

To start a spider with persistence support enabled, run it like this:
```
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
```
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a signal), and resume it later by issuing the same command:
```
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
```
Keeping persistent state between batches

Sometimes you’ll want to keep some persistent spider state between pause/resume batches. You can use the spider.state attribute for that, which should be a dict. There’s a built-in extension that takes care of serializing, storing and loading that attribute from the job directory, when the spider starts and stops.

Here’s an example of a callback that uses the spider state (other spider code is omitted for brevity):
```
def parse_item(self, response):
    # parse item here
    self.state['items_count'] = self.state.get('items_count', 0) + 1
```
Persistence gotchas

There are a few things to keep in mind if you want to be able to use the Scrapy persistence support:

Cookies expiration

Cookies may expire. So, if you don’t resume your spider quickly the requests scheduled may no longer work. This won’t be an issue if you spider doesn’t rely on cookies.

Request serialization

Requests must be serializable by the pickle module, in order for persistence to work, so you should make sure that your requests are serializable.

The most common issue here is to use lambda functions on request callbacks that can’t be persisted.

So, for example, this won’t work:
```
def some_callback(self, response):
    somearg = 'test'
    return scrapy.Request('http://www.example.com', callback=lambda r: self.other_callback(r, somearg))

def other_callback(self, response, somearg):
    print("the argument passed is: %s" % somearg)
```
But this will:
```
def some_callback(self, response):
    somearg = 'test'
    return scrapy.Request('http://www.example.com', callback=self.other_callback, meta={'somearg': somearg})

def other_callback(self, response):
    somearg = response.meta['somearg']
    print("the argument passed is: %s" % somearg)
```
If you wish to log the requests that couldn’t be serialized, you can set the SCHEDULER_DEBUG setting to True in the project’s settings page. It is False by default.

注意：

运行爬虫的时候将中间信息保存：
- 方式一：在settings.py文件中设置JOBDIR = ‘path’。
- 方式二：在具体的爬虫文件中指定：
```
custom_settings = {
    "JOBDIR": "path"
}
```
Windows or Linux环境下，一次Ctrl+C进程将收到中断信号，两次Ctrl+C则强制杀掉进程。

Linux环境下，kill -f main.py进程会收到一个中断信号，有了这个信号，scrapy就可以做一些后续的处理，若直接kill -f -9 main.py则进程无法收到一个中断信号，进程将被操作系统给强制杀掉，不会再进行任何后续处理。

示例：
```
scrapy crawl jobbole -s JOBDIR=job_info/001
```
- -s是-set的意思
- 不同的spider需要不同的目录，不同时刻启动的spider也需要不同的目录
- ctrl-c 后就会将暂停信息保存到job_info/001，要想重新开始则再次运行scrapy crawl jobbole -s JOBDIR=job_info/001然后会继续爬取没有做完的东西。
参考：
第六章慕课网学习-scrapy的暂停与重启 https://blog.csdn.net/shaququ/article/details/77587941
python爬虫进阶之scrapy的暂停与重启 https://blog.csdn.net/m0_37338590/article/details/81332540
三十二 Python分布式爬虫打造搜索引擎Scrapy精讲—scrapy的暂停与重启 https://www.cnblogs.com/meng-wei-zhi/p/8182788.html
Scrapy 官方文档 https://doc.scrapy.org/en/latest/topics/jobs.html 2019-3-8 ↩︎
查看全文

相关阅读:
上周热点回顾（6.5-6.11）团队
 云计算之路-阿里云上：14:20-14:55博客后台2台服务器都CPU 100%引发的故障团队
 牛客网Java刷题知识点之TCP、UDP、TCP和UDP的区别、socket、TCP编程的客户端一般步骤、TCP编程的服务器端一般步骤、UDP编程的客户端一般步骤、UDP编程的服务器端一般步骤
 牛客网Java刷题知识点之equals和hashcode()
spark运行时出现Neither spark.yarn.jars nor spark.yarn.archive is set错误的解决办法（图文详解）
大数据的结构类型（结构化数据、半结构化数据、准结构化数据、非结构化数据）
Spark 1.6.2 + Beam 2.0.0读取Mongodb数据进行相应逻辑处理
 Docker的基本概念
 Docker的基本构架
 Docker概念学习系列之Docker是什么？（1）

原文地址：https://www.cnblogs.com/onefine/p/10499318.html

scrapy之crawls的暂停与重启

Jobs: pausing and resuming crawls1

Job directory

How to use it

Keeping persistent state between batches

Persistence gotchas

Cookies expiration

Request serialization

注意：

运行爬虫的时候将中间信息保存：

示例：

Jobs: pausing and resuming crawls¹