Scrapy 小技巧（一）：使用 scrapy 自带的函数（follow & follow_all）优雅的生成下一个请求

zoukankan html css js c++ java

Scrapy 小技巧（一）：使用 scrapy 自带的函数（follow & follow_all）优雅的生成下一个请求
前言

如何优雅的获取同一个网站上下一次爬取的链接并放到生成一个 Scrapy Response 呢？

样例
```
from urllib import parse

import scrapy


class SitoiSpider(scrapy.Spider):
    name = "sitoi"

    start_urls = [
        'https://sitoi.cn',
    ]

    def parse(self, response):
        href_list = response.xpath("//div[@class='card']/a/@href").extract()
        for href in href_list:
            url = parse.urljoin(response.url, href)
            yield scrapy.Request(url=url, callback=self.parse_next)

    def parse_next(self, response):
        print(response.url)
```
方式一：使用 urllib 库来拼接 URL

这个方式是通过 urllib 库来对下一个 url 进行补全成完整的 url，再使用 scrapy.Request 的方式进行下一个页面的爬取。

优点
1. 在处理每一个 href 的时候可以添加一些自定义的内容（例如记录一下当前第几页了等等）
缺点
1. 需要引入其他的库
```
def parse(self, response):
    href_list = response.xpath("//div[@class='card']/a/@href").extract()
    for href in href_list:
        url = parse.urljoin(response.url, href)
        yield scrapy.Request(url=url, callback=self.parse_next)
```
方式二：使用 response 自带的 urljoin

这个方式是通过 Scrapy response 自带的 urljoin 对下一个 url 进行补全成完整的 url，再使用 scrapy.Request 的方式进行下一个页面的爬取。（和方式一基本相同）

优点
1. 不再需要在 spider 文件中引入多的第三方库。
```
def parse(self, response):
    href_list = response.xpath("//div[@class='card']/a/@href").extract()
    for href in href_list:
        url = response.urljoin(href)
        yield scrapy.Request(url=url, callback=self.parse_next)
```
方式三：使用 response 自带的 follow

这个方式是通过 Scrapy response 自带的 follow 进行下一个页面的爬取。

优点
1. 不再需要在 spider 文件中引入多的第三方库。
2. 不需要写 extract() 来提取 href 字符串，只需要传入 href 这个 Selector（可选）
3. 不需要写 url 拼接
4. xpath 只需要编写到 a 标签即可，可以省略掉 @href,即不需要获取 href 的 Selector，直接传递 a 的 Selector（可选）
```
def parse(self, response):
    href_list = response.xpath("//div[@class='card']/a/@href").extract()
    for href in href_list:
        yield response.follow(url=href, callback=self.parse_next)
```
变种一
1. 不写 extract() 来提取 href 字符串，传入 href 这个 Selector
```
def parse(self, response):
    href_list = response.xpath("//div[@class='card']/a/@href")
    for href in href_list:
        yield response.follow(url=href, callback=self.parse_next)
```
变种二
1. 不写 extract() 来提取 href 字符串，传入 href 这个 Selector
2. xpath 不写 @href，直接传递 a 的 Selector
```
def parse(self, response):
    href_list = response.xpath("//div[@class='card']/a/")
    for href in href_list:
        yield response.follow(url=href, callback=self.parse_next)
```
方式四：使用 response 自带的 follow_all

这个方式是通过 Scrapy response 自带的 follow_all 进行下一个页面的爬取。

优点
1. 不再需要在 spider 文件中引入多的第三方库。
2. 不需要写 extract() 来提取 href 字符串，只需要传入 href 这个 selector（可选）
3. 不需要写 url 拼接
4. 只需要编写到 a 标签即可，可以省略掉 @href，即不需要获取 href 的 SelectorList，直接传递 a 的 SelectorList（可选）
5. 不需要编写遍历，直接把抓到的 url 的 SelectorList 放入即可
缺点
1. 如果中间还有什么逻辑，就不太适用了（例如记录一下当前第几页了等等）
```
def parse(self, response):
    href_list = response.xpath("//div[@class='card']/a")
    yield from response.follow_all(urls=href_list, callback=self.parse_next)
```
变种

注：前方高能

一行代码搞定。
```
def parse(self, response):
    yield from response.follow_all(xpath="//div[@class='card']/a", callback=self.parse_next)
```
欢迎访问我的个人博客：https://sitoi.cn
查看全文

相关阅读:
Spring Security配置logout地址
 flex布局
 视口的学习笔记
 box-sizing属性
 css清除浮动
 line-height的理解
 position和float小结
 css居中方法小结
 margin重叠
 浅谈负margin

原文地址：https://www.cnblogs.com/sitoi/p/13056755.html

Scrapy 小技巧（一）：使用 scrapy 自带的函数（follow & follow_all）优雅的生成下一个请求

前言

样例

方式一：使用 urllib 库来拼接 URL

方式二：使用 response 自带的 urljoin

方式三：使用 response 自带的 follow

方式四：使用 response 自带的 follow_all