1. 代码
import re import scrapy from Fang.items import esf_FangItem class ExampleSpider(scrapy.Spider): name = 'example' allowed_domains = ['www.fang.com'] start_urls = ['https://www.fang.com/SoufunFamily.htm'] def parse(self, response): trs = response.xpath('//div[@id="c02"]//tr') province = None for tr in trs: province_f = tr.xpath('./td[2]//text()').get() province_f = re.sub(r"s", "", province_f) if province_f: province = province_f cities = tr.xpath('./td[3]/a') for i in cities: city = i.xpath('./text()').get() city_url = i.xpath('./@href').get() # print(city, city_url) yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)}) # break # break def parse_url(self, response): print(2)
2. 问题描述运行项目时,parse_url不执行,即不能打印2
日志打印如下:
2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bj.fang.com': <GET http://bj.fang.com/> 2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sh.fang.com': <GET http://sh.fang.com/> 2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tj.fang.com': <GET http://tj.fang.com/> 2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'cq.fang.com': <GET http://cq.fang.com/>
3. 解答
百度得知是二次解析的域名被过滤掉了
解决方法:
方法一:
去掉域名: allowed_domains = ['www.fang.com']
或将其改为: allowed_domains = ['fang.com']
方法二:
加上:dont_filter=True (不推荐此方法)
yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)}, dont_filter=True)