scrapy默认是只支持http,https这些个下载,不支持ftp的(用ftp下载文件),但实际项目需求千变万化,以为http,https已经能满足99%的需求了,但遇到那1%的也必须处理的,怎么办?
好在scrapy提供插件支持,编写一个,就能搞定了。
先提供我编写的这个插件。
#! -*- encoding:utf-8 -*- #file is 'ftp.py', sys.path:'src.middleware.ftp.FtpDownloadHandler'
__author__ = 'C.L.TANG' import urllib2 from scrapy.http import Response class FtpDownloadHandler(object): def download_request(self, request, spider): """Return a deferred for the HTTP download""" handler = urllib2.FTPHandler() req = urllib2.Request(url = request.url) opener = urllib2.build_opener(handler) f = opener.open(req) b = f.read() print len(b) respcls = Response(url = request.url, body=b, request = request) return respcls
然后在自己项目的settings.py文件中指定:
DOWNLOAD_HANDLERS = {'ftp' : 'src.middleware.ftp.FtpDownloadHandler'}
在爬虫类中有:
#! -*- encoding:utf-8 -*- from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.http import Request class ShopSpider(CrawlSpider): name = '958shop' allowed_domains = ['958shop.com'] def start_requests(self): request = Request(url = 'ftp://b9:b9@ftp.958shop.com/2011/11/15/52076863815926.jar') request.callback = self.down_debug_html return [request,] def down_debug_html(self, response): #在这里调用存入下载链接地址的方法. #file_name = response.meta['file_name'] print response.url filename = 'debug.html' open(filename, 'wb').write(response.body)
执行程序运行就可以看到debug.html文件中一堆二进制文件了。
我这里的实际执行图为:
2012-01-12 15:04:09+0800 [958shop] DEBUG: Crawled (200) <GET ftp://b9:b9@ftp.958shop.com/2011/11/15/52076863815926.jar> (referer: None) ftp://b9:b9@ftp.958shop.com/2011/11/15/52076863815926.jar 2012-01-12 15:04:09+0800 [958shop] INFO: Closing spider (finished) 2012-01-12 15:04:09+0800 [958shop] INFO: Dumping spider stats:
显示为crawled 200,表示成功了。
这里有一个隐含的知识点,如果是文件成为这个回调函数的response对象,是不能在进行抽取操作的,否则在实例化抽取对象中会出现错误