一.Scrapy的日志等级
- 在使用scrapy crawl spiderFileName运行程序时,在终端里打印输出的就是scrapy的日志信息。
- 日志信息的种类:
ERROR : 一般错误 WARNING : 警告 INFO : 一般的信息 DEBUG : 调试信息
- 设置日志信息指定输出:
在settings.py配置文件中,加入
LOG_LEVEL = ‘指定日志信息种类’即可。
LOG_FILE = 'log.txt'则表示将日志信息写入到指定文件中进行存储。
二.请求传参
- 在某些情况下,我们爬取的数据不在同一个页面中,例如,我们爬取一个电影网站,电影的名称,评分在一级页面,而要爬取的其他电影详情在其二级子页面中。这时我们就需要用到请求传参。
处理post请求的参数:
创建项目:
代码:
import scrapy class PostSpider(scrapy.Spider): name = 'post' # allowed_domains = ['www.xxx.com'] start_urls = ['https://fanyi.baidu.com/sug'] def start_requests(self): data = { 'kw':'dog' } for url in self.start_urls: yield scrapy.FormRequest(url=url,formdata=data,callback=self.parse) def parse(self, response): print(response.text)
settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False
查看请求的数据:
案例二:
# -*- coding: utf-8 -*- import scrapy from moviePro.items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.4567tv.tv/frim/index1.html'] #解析详情页中的数据 def parse_detail(self,response): #response.meta返回接收到的meta字典 item = response.meta['item'] actor = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[3]/a/text()').extract_first() item['actor'] = actor yield item def parse(self, response): li_list = response.xpath('//li[@class="col-md-6 col-sm-4 col-xs-3"]') for li in li_list: item = MovieproItem() name = li.xpath('./div/a/@title').extract_first() detail_url = 'https://www.4567tv.tv'+li.xpath('./div/a/@href').extract_first() item['name'] = name #meta参数:请求传参.meta字典就会传递给回调函数的response参数 yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})
settings.py LOG_LEVEL = "ERROE" LOG_FILE = './log.txt' #输出日志
items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MoveproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() actor = scrapy.Field()