由于目标采集资源为gb2312发生乱码,采用中间件的解决方式,中间件为DownloaderMiddleware
1 def process_response(self, request, response, spider): 2 # Called with the response returned from the downloader. 3 # Must either; 4 # - return a Response object 5 # - return a Request object 6 response = HtmlResponse(url=response.url, body=response.body, encoding='utf-8') 7 # - or raise IgnoreRequest 8 return response
即在下载网页阶段是将网页转换为utf-8格式,另外需要将中间激活,在配置文件settings.py文件中插入代码,以激活
1 DOWNLOADER_MIDDLEWARES = {'news.middlewares.NewsDownloaderMiddleware': 1000}
至此,爬虫文件中不需要进行额外的转码,即可正常显示中文了