zoukankan      html  css  js  c++  java
  • scrapy pipelines导出各种格式

    scrapy在使用pipelines的时候,我们经常导出csv,json.jsonlines等等格式。每次都需要写一个类去导出,很麻烦。

    这里我整理一个pipeline文件,支持多种格式的。

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    from scrapy import signals
    from scrapy.exporters import *
    import logging
    logger=logging.getLogger(__name__)
    class BaseExportPipeLine(object):
        def __init__(self,**kwargs):
            self.files = {}
            self.exporter=kwargs.pop("exporter",None)
            self.dst=kwargs.pop("dst",None)
            self.option=kwargs
        @classmethod
        def from_crawler(cls, crawler):
            pipeline = cls()
            crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
            crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
            return pipeline
    
        def spider_opened(self, spider):
            file = open(self.dst, 'wb')
            self.files[spider] = file
            self.exporter = self.exporter(file,**self.option)
            self.exporter.start_exporting()
    
        def spider_closed(self, spider):
            self.exporter.finish_exporting()
            file = self.files.pop(spider)
            file.close()
    
        def process_item(self, item, spider):
            self.exporter.export_item(item)
            return item
    
    # 
    # 'fields_to_export':["url","edit_url","title"] 设定只导出部分字段,以下几个pipeline都支持这个参数
    # 'export_empty_fields':False 设定是否导出空字段 以下几个pipeline都支持这个参数
    # 'encoding':'utf-8' 设定默认编码,以下几个pipeline都支持这个参数
    # 'indent' :1: 设置缩进,这个参数主要给JsonLinesExportPipeline使用
    # "item_element":"item"设置xml节点元素的名字,只能XmlExportPipeline使用,效果是<item></item>
    # "root_element":"items"设置xml根元素的名字,只能XmlExportPipeline使用,效果是<items>里面是很多item</items>
    # "include_headers_line":True 是否包含字段行, 只能CsvExportPipeline使用
    # "join_multivalued":","设置csv文件的分隔符号, 只能CsvExportPipeline使用
    # 'protocol':2设置PickleExportPipeline 导出协议,只能PickleExportPipeline使用
    # "dst":"items.json" 设置目标位置
    class JsonExportPipeline(BaseExportPipeLine):
        def __init__(self):
            option={"exporter":JsonItemExporter,"dst":"items.json","encoding":"utf-8","indent":4,}
            super(JsonExportPipeline, self).__init__(**option)
    class JsonLinesExportPipeline(BaseExportPipeLine):
        def __init__(self):
            option={"exporter":JsonLinesItemExporter,"dst":"items.jl","encoding":"utf-8"}
            super(JsonLinesExportPipeline, self).__init__(**option)
    class XmlExportPipeline(BaseExportPipeLine):
        def __init__(self):
            option={"exporter":XmlItemExporter,"dst":"items.xml","item_element":"item","root_element":"items","encoding":'utf-8'}
            super(XmlExportPipeline, self).__init__(**option)
    class CsvExportPipeline(BaseExportPipeLine):
        def __init__(self):
            # 设置分隔符的这个,我这里测试是不成功的
            option={"exporter":CsvItemExporter,"dst":"items.csv","encoding":"utf-8","include_headers_line":True, "join_multivalued":","}
            super(CsvExportPipeline, self).__init__(**option)
    class  PickleExportPipeline(BaseExportPipeLine):
        def __init__(self):
            option={"exporter":PickleItemExporter,"dst":"items.pickle",'protocol':2}
            super(PickleExportPipeline, self).__init__(**option)
    class  MarshalExportPipeline(BaseExportPipeLine):
        def __init__(self):
            option={"exporter":MarshalItemExporter,"dst":"items.marsha"}
            super(MarshalExportPipeline, self).__init__(**option)
    class  PprintExportPipeline(BaseExportPipeLine):
        def __init__(self):
            option={"exporter":PprintItemExporter,"dst":"items.pprint.jl"}
            super(PprintExportPipeline, self).__init__(**option)

    上面的定义好之后。我们就可以在settings.py里面设置导出指定的类了。

    ITEM_PIPELINES = {
        'ScrapyCnblogs.pipelines.PprintExportPipeline': 300,
        #'ScrapyCnblogs.pipelines.JsonLinesExportPipeline': 302,
        #'ScrapyCnblogs.pipelines.JsonExportPipeline': 303,
        #'ScrapyCnblogs.pipelines.XmlExportPipeline': 304,
    }

    是不是很强大。如果你感兴趣,可以去github上找找这个部分的源码,地址如下:https://github.com/scrapy/scrapy/blob/master/scrapy/exporters.py

    exporters的测试代码在这个位置:https://github.com/scrapy/scrapy/blob/master/tests/test_exporters.py,有兴趣的话,可以拜读下人家的源码吧。

    详细的使用案例,可以参考我的一个github项目: https://github.com/zhaojiedi1992/ScrapyCnblogs

  • 相关阅读:
    KY2成绩排序
    python 获取list中元素的索引
    pandas 读取指定一列数据
    python 删除列表中的第一位元素
    python 时间戳
    python 除法保留小数点后两位
    python 读取excel表格的一列数据并去重
    python中获取Excel表格sheet页整页内容
    IDEA创建spring boot项目
    servlet一些问题
  • 原文地址:https://www.cnblogs.com/zhaojiedi1992/p/zhaojiedi_python_005_scrapy.html
Copyright © 2011-2022 走看看