zoukankan      html  css  js  c++  java
  • scrapy之Pipeline

    官方文档:https://docs.scrapy.org/en/latest/topics/item-pipeline.html

      激活pipeline,需要在settings里配置,然而这里配置的pipeline会作用于所有的spider。加入项目中有很多spider在运行。item pipeline的处理就会很麻烦,你可以通过process_item(self,item,spider)中的spider参数来判断是来自哪个爬虫,但是这种方法很冗余。更好的做法是配置spider类中的custom_settings属性。为每一个spider配置不同的pipeline。示例如下:

      同时,这里你也会看到custom_settings的用法和用处。

    class XiaohuaSpider(scrapy.Spider):
        name = 'xiaohua'
        custom_settings = {
            'ITEM_PIPELINES ':{
                'TB.pipelines.TBMongoPipeline':300,
            }
        }
    

    一  method

      1 process_item(self,item,spider)

      This method is called for every item pipeline component

      2 open_spider(self,spider)

      This method is called when the spider is opened.

      3 close_spider(self,spider)

      4 from_crawler(cls,crawler)

      It must return a new instance of the pipeline

    二 Item Pipeline example

      1 write items to mongodb

    import pymongo
    
    class MongoPipeline(object):
    
        collection_name = 'scrapy_items'
    
        def __init__(self, mongo_uri, mongo_db):
            self.mongo_uri = mongo_uri
            self.mongo_db = mongo_db
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_uri=crawler.settings.get('MONGO_URI'),
                mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
            )
    
        def open_spider(self, spider):
            self.client = pymongo.MongoClient(self.mongo_uri)
            self.db = self.client[self.mongo_db]
    
        def close_spider(self, spider):
            self.client.close()
    
        def process_item(self, item, spider):
            self.db[self.collection_name].insert_one(dict(item))
            return item

      2 duplicates filter

    from scrapy.exceptions import DropItem
    
    class DuplicatesPipeline(object):
    
        def __init__(self):
            self.ids_seen = set()
    
        def process_item(self, item, spider):
            if item['id'] in self.ids_seen:
                raise DropItem("Duplicate item found: %s" % item)
            else:
                self.ids_seen.add(item['id'])
                return item

      

  • 相关阅读:
    idea 中使用 svn
    [剑指offer] 40. 数组中只出现一次的数字
    [剑指offer] 39. 平衡二叉树
    [剑指offer] 38. 二叉树的深度
    [剑指offer] 37. 数字在排序数组中出现的次数
    [剑指offer] 36. 两个链表的第一个公共结点
    [剑指offer] 35. 数组中的逆序对
    vscode在win10 / linux下的.vscode文件夹的配置 (c++/c)
    [剑指offer] 34. 第一个只出现一次的字符
    [剑指offer] 33. 丑数
  • 原文地址:https://www.cnblogs.com/654321cc/p/8877079.html
Copyright © 2011-2022 走看看