zoukankan      html  css  js  c++  java
  • scrapy 自定义图片路径保存,并存到数据库中

    scrapy中有个自带的pipeline工具,ImagesPipeline,可以专门用来储存图片到本地。

    但默认储存地址无法配置,所以我们需要写一个自己的pipeline用于储存图片。

    先分析一下我们的需求:

    1.修改图片路径,路径根据采集到的item中的数据变化;

    2.将数据库中保存图片的url更改为我们的本地文件路径。

    首先需要继承原pipeline:
    class DownloadImagesPipeline(ImagesPipeline):

    然后我们可以查看源码,看看需要改那些地方:

    首先是file_path方法,该方法返回了图片储存路径:

        def file_path(self, request, response=None, info=None):
            ## start of deprecation warning block (can be removed in the future)
            def _warn():
                from scrapy.exceptions import ScrapyDeprecationWarning
                import warnings
                warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                              'please use file_path(request, response=None, info=None) instead',
                              category=ScrapyDeprecationWarning, stacklevel=1)
     
            # check if called from image_key or file_key with url as first argument
            if not isinstance(request, Request):
                _warn()
                url = request
            else:
                url = request.url
     
            # detect if file_key() or image_key() methods have been overridden
            if not hasattr(self.file_key, '_base'):
                _warn()
                return self.file_key(url)
            elif not hasattr(self.image_key, '_base'):
                _warn()
                return self.image_key(url)
            ## end of deprecation warning block
     
            image_guid = hashlib.sha1(to_bytes(url)).hexdigest()  # change to request.url after deprecation
            return 'full/%s.jpg' % (image_guid)
    

      

    然后是item_completed方法,该方法返回了item。

        def item_completed(self, results, item, info):
            if isinstance(item, dict) or self.images_result_field in item.fields:
                item[self.images_result_field] = [x for ok, x in results if ok]
            return item
    

      

    最后是他们的请求方法get_media_requests,我们需要传入item的内容用于文件夹的命名:

        def get_media_requests(self, item, info):
            return [Request(x) for x in item.get(self.images_urls_field, [])]
    

      

    好,我们现在开始重写这三个方法:

    首先重写get_media_requests,传入文件夹名称,这里加了一个判断避免报错,同时将return改成了yield,使用return也是可以的,这一块主要是为了校验fetch_date,以及传入fetch_date:

        def get_media_requests(self, item, info):
            if isinstance(item, LiveItem) and item.get('image') and item.get('fetch_date'):
                yield Request(item['image'].replace('\', '/'), meta={'fetch_date': item.get('fetch_date')})
    

      

    然后是file_path, 我们只需要复制源码过来修改一下储存路径即可:

        def file_path(self, request, response=None, info=None):
            ## start of deprecation warning block (can be removed in the future)
            def _warn():
                from scrapy.exceptions import ScrapyDeprecationWarning
                import warnings
                warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, '
                              'please use file_path(request, response=None, info=None) instead',
                              category=ScrapyDeprecationWarning, stacklevel=1)
     
            # check if called from image_key or file_key with url as first argument
            if not isinstance(request, Request):
                _warn()
                url = request
            else:
                url = request.url
     
            # detect if file_key() or image_key() methods have been overridden
            if not hasattr(self.file_key, '_base'):
                _warn()
                return self.file_key(url)
            elif not hasattr(self.image_key, '_base'):
                _warn()
                return self.image_key(url)
            ## end of deprecation warning block
     
            image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
            return '%s/%s.jpg' % (int(time.mktime(time.strptime(request.meta['fetch_date'], "%Y-%m-%d %H:%M:%S"))),image_guid)
    

      我们的图片下载完成后,会使用一个元组(即results)传入 item_completed 方法,其中包含一些图片的信息,我们可以打印看看:

    [(True, {'url': 'https://rpic.douyucdn.cn/asrpic/180918/5070841_1710.jpg/dy1', 'path': '1537261918/7ccaf3dbc7aef44c597cbd1ec4f01ca2fe1995c5.jpg', 'checksum': '92eeb26633a9631ba457f4f524b2d8c2'})]
    

      所以这里我们可以直接对item中的url进行修改为path中的内容即可:

        def item_completed(self, results, item, info):
            image_paths = [info.get('path', None) for success, info in results if success and info]
            if not image_paths:
                return item
            if isinstance(item, LiveItem):
                item['image'] = u''.join(image_paths)
            return item
    

      

  • 相关阅读:
    第一章、web应用安全概论--web应用系统介绍--TCP/IP协议
    IIS配置导入导出
    shell习题第10题:打印每个单词的字数
    腾讯云的对象存储COS
    shell习题第9题:sed的常用用法
    shell习题第8题:监控nginx的502状态
    shell习题第7题:备份数据库
    shell习题第6题:监听80端口
    Python的math模块
    Python模块
  • 原文地址:https://www.cnblogs.com/pythonClub/p/9858872.html
Copyright © 2011-2022 走看看