-
深度爬取 即 爬取的数据没有在同一个页面 首页+详情页
-
在scrapy中如果没有请求传参我们是无法持久化存储数据
-
实现方式:
- scrapy.Request(url, callback, meta)
- meta是一个字典 可以将meta传递给callback
- callback取出meta
-
response.meta
<details> <summary>点击查看代码</summary> ``` class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['www.baidu.com'] start_urls = ['https://bj.5i5j.com/xiaoqu/xichengqu/'] def parse(self, response): li_lst = response.xpath('/html/body/div[6]/div[1]/div[2]/ul/li') # print(li_lst) for li in li_lst: title = li.xpath('./div[2]/h3/a/text()').extract_first() print(title) detail_url = li.xpath('./div[1]/a/@href').extract_first() detail_url = 'https://bj.5i5j.com' + detail_url item = MovieproItem() item['title'] = title # 对详情页url发起请求 # meta作用:可以将meta字典传递给callback yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={'item': item}) # 被用作于解析详情页的数据 def parse_detail(self, response): # 接收传递过来的meta item = response.meta['item'] desc_lst = response.xpath('/html/body/div[5]/div[3]/div[3]/div[1]/div/ul/li') for result in desc_lst: desc = result.xpath('./span/text()').extract_first() print(desc) # item['desc'] = desc yield item ``` </details>
-
- scrapy.Request(url, callback, meta)