zoukankan      html  css  js  c++  java
  • 关于scrapy中如何区分是接着发起请求还是开始保存文件

    一.区分

    根据yield迭代器生成的对象是request对象还是item对象

    二.item

    1.配置tem对象

    items.py文件中设置类

    class MyscrapyItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        price = scrapy.Field()
        prostatus = scrapy.Field()
    

    2.在爬虫程序中导入该类写相应的函数

    from myscrapy.items import MyscrapyItem
    def get_info(self,response):
        elements_list = response.css('.product')
        for element in elements_list:
            title = element.css('.productTitle a::attr(title)').extract_first() #这是css选择器
            price = element.css('.productPrice em::attr(title)').extract_first()
            prostatus = element.css('.productStatus em::text').extract_first()
            item = MyscrapyItem()  #实例话一个item对象
            item['title'] = title  #填写配置的参数
            item['price'] = price
            item['prostatus'] = prostatus
            yield item
    

    三.再获得item参数后scrapy会自动执行pipelines.py文件中内容

    1.settings文件进行注册

    ITEM_PIPELINES = {
       'myscrapy.pipelines.MyscrapyPipeline': 300,   #小的优先级高
       # 'myscrapy.pipelines.MyscrapyPipeline1': 500,
    }
    #和中间件一个道理
    

    2.配置MyscrapyPipeline方法

    #其中两个方法非常常用
    #def open_spider(self): 运行这个函数开始执行,一般都是连接数据库用
    #def close_spider(self): 运行完这个函数执行,一般都是关闭数据库用
    
    #简单拿MongoDB举例
    from pymongo import MongoClient
    
    class MyscrapyPipeline(object):
    
        def __init__(self,HOST,PORT,USER,PWD,DB,TABLE):
            self.HOST = HOST
            self.PORT = PORT
            self.USER = USER
            self.PWD = PWD
            self.DB = DB
            self.TABLE = TABLE
    	#执行__init__之前执行
        @classmethod
        def from_crawler(cls,crawler):
            HOST = crawler.settings.get('HOST')  #crawler.settings可以直接获得setting文件中的所有名称
            PORT = crawler.settings.get('PORT')
            USER = crawler.settings.get('USER')
            PWD = crawler.settings.get('PWD')
            DB = crawler.settings.get('DB')
            TABLE = crawler.settings.get('TABLE')
            return cls(HOST,PORT,USER,PWD,DB,TABLE)
    
    
        def open_spider(self,spider):
            self.client = MongoClient(host=self.HOST,port=self.PORT,username=self.USER,password=self.PWD)
            print('连接数据库成功')
    
        def close_spider(self,spider):
            self.client.close()
            print('关闭数据库')
    
    
        def process_item(self, item, spider):
            self.client[self.DB][self.TABLE].insert_one(dict(item))
            return item
    
  • 相关阅读:
    一行代码搞定Dubbo接口调用
    测试周期内测试进度报告规范
    jq 一个强悍的json格式化查看工具
    浅析Docker容器的应用场景
    HDU 4432 Sum of divisors (水题,进制转换)
    HDU 4431 Mahjong (DFS,暴力枚举,剪枝)
    CodeForces 589B Layer Cake (暴力)
    CodeForces 589J Cleaner Robot (DFS,或BFS)
    CodeForces 589I Lottery (暴力,水题)
    CodeForces 589D Boulevard (数学,相遇)
  • 原文地址:https://www.cnblogs.com/pythonywy/p/11728456.html
Copyright © 2011-2022 走看看