zoukankan      html  css  js  c++  java
  • Scrapy持久化存储-爬取数据转义

    Scrapy持久化存储

    爬虫爬取数据转义问题

    使用这种格式,会自动帮我们转义

    'insert into wen values(%s,%s)',(item['title'],item['content'])

    基于终端的指令:

    只可以将parse方法的返回值存储到本地的文本文件中,支持(json,jsonlines,jl,csv,xml,marshal,pickle)

    保存指令

    scrapy crawl name -o xxx.csv

    好处:简介高效便捷

    缺点:局限性比较大(只能保存到本地文件,不能保存到数据库)

    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class DuanziSpider(scrapy.Spider):
        name = 'duanzi'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['http://duanziwang.com/']
    
        def parse(self, response):
            div_list=response.xpath('//main/article')
            data=[]
            for i in div_list:
                title=i.xpath('.//h1/a/text()').extract_first()
                #xpath返回的是存放selector对象的列表,想要拿到数据需要调用extract()函数取出内容,如果列表长度为1可以使用extract_first()
                content=i.xpath('./div[@class="post-content"]/p/text()').extract_first()
                da={
                    'title':title,
                    'content':content
                }
                data.append(da)
            return data
    
    

    基于管道的持久化存储操作

    编码流程

    1.数据解析

    # -*- coding: utf-8 -*-
    import scrapy
    from zx_spider.items import ZxSpiderItem
    
    
    class Duanzi2Spider(scrapy.Spider):
        name = 'duanzi2'
        start_urls = ['https://ishuo.cn']
    
        def parse(self, response):
            data_list=response.xpath('//div[@id="list"]/ul/li')
    
            for i in data_list:
                title=i.xpath('./div[2]/a/text()').extract_first()
                content=i.xpath('./div[1]/text()').extract_first()
                print(title)
                print(content)
                #创建item对象将内容填入
                item=ZxSpiderItem()
                item['title']=title
                item['content']=content
    
                #将item提交给管道
                yield item
    
    

    2.解析的数据封装存储到item对象(在item中定义相关的属性)

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class ZxSpiderItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        title = scrapy.Field()
        content = scrapy.Field()
        # pass
    

    3.将item类型对象提交给管道持久化存储操作,在管道类的process_item中要将其接受到的item对象中的数据进行持久化操作

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class ZxSpiderPipeline(object):
        fw=None
        #该方法只在开始爬虫的时候调用一次
        def open_spider(self,spider):
            print("开始写入爬虫数据")
            self.fw=open('./zx/duanzi2.csv',"w",encoding='utf8')
    
        #该方法可以接受到爬虫文件提交过来的item对象
        def process_item(self, item, spider):
            title=item['title']
            content=item['content']
    
            self.fw.write(title+"
    "+content+'
    ')
            return item
    
        def close_spider(self,spider):
            print("爬虫数据写入完成")
            self.fw.close()
    
    

    4.在配置文件中开启管道

    ITEM_PIPELINES = {
       'zx_spider.pipelines.ZxSpiderPipeline': 300,
        #300表示优先级,数字越小优先级越高
    }
    

    将爬取的数据存储到多个平台(文件,mysql)

    ZxSpiderPipeline中的return不是没有用处的,是讲item传入下一个优先级的管道进行处理(前提要在setting里面配置)

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    import pymysql
    
    class ZxSpiderPipeline(object):
        fw=None
        #该方法只在开始爬虫的时候调用一次
        def open_spider(self,spider):
            print("开始写入爬虫数据")
            self.fw=open('./zx/duanzi2.csv',"w",encoding='utf8')
    
        #该方法可以接受到爬虫文件提交过来的item对象
        def process_item(self, item, spider):
            title=item['title']
            content=item['content']
    
            self.fw.write(title+"
    "+content+'
    ')
            return item
    
        def close_spider(self,spider):
            print("爬虫数据写入完成")
            self.fw.close()
    class MysqlSpiderPipeline(object):
        conn=None
        cursor=None
        def open_spider(self,spider):
            print("爬虫数据库写入完成")
            self.conn=pymysql.Connect(host='127.0.0.1',port=3306,user="root",password='zx125',db="zx",charset='utf8')
    
        def process_item(self, item, spider):
            self.cursor=self.conn.cursor()
            try:
                self.cursor.execute('insert into wen values(%s,%s)',(item['title'],item['content']))
                self.conn.commit()
            except Exception as e:
                print(e)
                self.conn.rollback()
            return item
    
        def close_spider(self,spider):
            print("爬虫数据库写入完成")
            self.cursor.close()
            self.conn.close()
    

    配置

    ITEM_PIPELINES = {
       'zx_spider.pipelines.ZxSpiderPipeline': 300,
       'zx_spider.pipelines.MysqlSpiderPipeline': 301,
        #300表示优先级,数字越小优先级越高
    }
    
  • 相关阅读:
    [C#]mouse_event模拟点击时坐标参数无效?!
    体验boost::spirit
    喜讯,公司换宽屏液晶显示器了
    [疑难杂症]扩展AxWebBrowser的问题???
    VS 2005 BUG: 新增JScript文件编码问题引起乱码?
    在JavaScript中实现命名空间
    [C#]实现序列号生成器
    基于Web的仿WF工作流设计器
    分享:基于UDP协议实现可靠的数据传输
    远程控制之屏幕截取 小结
  • 原文地址:https://www.cnblogs.com/zx125/p/11509371.html
Copyright © 2011-2022 走看看