zoukankan      html  css  js  c++  java
  • day26-爬虫-scrapy框架初识

     1.框架了解:高性能的异步下载、解析、持久化存储
    2.下载安装,创建项目-----------
    pip install wheel
    Twisted 5步安装!
    二.安装
    
      Linux:
    
          pip3 install scrapy
    
     
    
      Windows:
    
          a. pip3 install wheel
    
          b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    
          c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
    
          d. pip3 install pywin32
    
          e. pip3 install scrapy

    scrapy startproject 项目名称
    3.项目使用--5步听视频总结:
    1.新建工程 scrapy startproject fristBlood
    2.cd fristBlood 新建爬虫文件scrapy genspider chouti www.chouti.com(在spiders中会新增一个chouti.py,注意名称、start_url,注释#allowed_domains)
    3.在chouti.py中进行parse方法的编写
    4.配置文件的配置:在settings中进行UA伪装、ROBOTSTXT_OBEY = False
    5.配置完后,在cmd中执行:scarpy crawl 爬虫文件名称

    1.爬取chouti fristBlood
    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class ChoutiSpider(scrapy.Spider):
        #爬虫文件的名称:可以指定某一个具体的爬虫文件
        name = 'chouti'
        #允许的域名:
        #allowed_domains = ['www.chouti.com']
        #起始url列表:工程被执行后就可以获取该列表中url所对应的页面数据
        start_urls = ['https://dig.chouti.com/']
        
        #该方法作用:就是讲起始url列表中指定url对应的页面数据进行解析操作
        #response参数:就是对起始url发起请求后对应的响应对象
        def parse(self, response):
            print(response)
    chouti.py

    2.爬取糗百 ---注意parse中 qiubaiPro
    #extract()可以将selector对象中存储的文本内容获取
    封装一个可迭代类型
    基于终端指令执行 scarpy crawl -o data.csv qiubai --nolog---不常用
    # -*- coding: utf-8 -*-
    import scrapy
    
    
    class QiubaiSpider(scrapy.Spider):
        name = 'qiubai'
        #allowed_domains = ['www.fdsfds.com']
        start_urls = ['https://www.qiushibaike.com/text/']
    
        def parse(self, response):
            #xpath返回的列表元素类型为Selecor类型
            div_list = response.xpath('//div[@id="content-left"]/div')
            #声明一个用于存储解析到数据的列表
            all_data = []
            
            for div in div_list:
                #extract()可以将selector对象中存储的文本内容获取
                #author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
                author = div.xpath('./div[1]/a[2]/h2/text()').extract_first() #取出第一个元素,不用[0]了--意义同上行
                content = div.xpath('.//div[@class="content"]/span//text()').extract() #//text获取的内容不止一个,extract()获取多个列表内容
                content = "".join(content) #将列表转化成字符串
                
                dict = {
                    'author':author,
                    'content':content
                }
                all_data.append(dict)
                
            return all_data
                #持久化存储方式:
                    #1.基于终端指令:必须保证parse方法有一个可迭代类型对象的返回
                    #2.基于管道
    qiubai.py

    3.爬取糗百--基于管道执行--注意item pipeLinepro
    pipelines.py编写
    在settings中开启ITEM_PIPELINES 67-69行
    ITEM_PIPELINES数值越小,优先级越高(管道中)
    一个写到磁盘,一个写到数据库中

    屏蔽日志信息 scarpy crawl chouti --nolog
    cls清屏
    import scrapy
    
    
    class PipelineproItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        author = scrapy.Field()
        content = scrapy.Field()
    items.py
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for pipeLinePro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'pipeLinePro'
    
    SPIDER_MODULES = ['pipeLinePro.spiders']
    NEWSPIDER_MODULE = 'pipeLinePro.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'pipeLinePro (+http://www.yourdomain.com)'
    USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'pipeLinePro.middlewares.PipelineproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'pipeLinePro.middlewares.PipelineproDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
        'pipeLinePro.pipelines.PipelineproPipeline': 300,
        'pipeLinePro.pipelines.MyPipeline': 301,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    settings.py
    # -*- coding: utf-8 -*-
    import scrapy
    from pipeLinePro.items import PipelineproItem
    
    class QiubaiSpider(scrapy.Spider):
        name = 'qiubai'
        #allowed_domains = ['www.ds.com']
        start_urls = ['https://www.qiushibaike.com/text/']
    
        def parse(self, response):
            # xpath返回的列表元素类型为Selecor类型
            div_list = response.xpath('//div[@id="content-left"]/div')
            # 声明一个用于存储解析到数据的列表
            all_data = []
        
            for div in div_list:
                # extract()可以将selector对象中存储的文本内容获取
                # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
                author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
                content = div.xpath('.//div[@class="content"]/span//text()').extract()
                content = "".join(content)
                
                #实例化item对象
                item = PipelineproItem()
                #将解析到的数据值存储到item对象中
                item['author'] = author
                item['content'] = content
                
                #将item对象提交给管道
                yield item
                
            # 持久化存储方式:
            # 1.基于终端指令:必须保证parse方法有一个可迭代类型对象的返回
            # 2.基于管道:
                #1.items.py:对该文件中的类进行实例化操作(item对象:存储解析到的数据值)。
                #2.pipeline.py:管道,作用就是接受爬虫文件提交的item对象,然后将该对象中的数据值进行持久化存储操作
            
    qiubai.py-管道
    # -*- coding: utf-8 -*-
    
    import pymysql
    
    class PipelineproPipeline(object):
        #作用:每当爬虫文件向管道提交一次item,该方法就会被调用一次。item参数就是接受到爬虫文件给提交过来的item对象
        #该方法只有在开始爬虫的时候被调用一次
        fp = None
        def open_spider(self,spider): #父类的方法
            print('开始爬虫')
            self.fp = open('./qiubai_data.txt', 'w', encoding='utf-8')
            
        def process_item(self, item, spider): #父类的方法
            author = item['author']
            content = item['content']
            self.fp.write(author+":"+content)
            
            return item
    
        #该方法只有在爬虫结束后被调用一次
        def close_spider(self,spider):  #父类的方法
            print('爬虫结束')
            self.fp.close()
            
    class MyPipeline(object):
        conn = None
        cursor = None
        # 作用:每当爬虫文件向管道提交一次item,该方法就会被调用一次。item参数就是接受到爬虫文件给提交过来的item对象
        def open_spider(self,spider):
            self.conn = pymysql.Connect(host="192.168.12.65", port=3306, db="scrapyDB", charset="utf8", user="root")
            self.cursor = self.conn.cursor()
            print('mysql')
            
        def process_item(self, item, spider):
            author = item['author']
            content = item['content']
            
            sql = "insert into qiubai values('%s','%s')" % (author,content)  #qiubai是表名
            try:
                self.cursor.execute(sql) #执行sql
                self.conn.commit() #事务的处理,没有问题提交,有问题回滚
            except Exception as e:
                print(e)
                self.conn.rollback()
            return item
    pipelines.py

    管道操作4步---听视频自己总结:
    前提要在parse方法中获取解析到的数据,
    1.将解析到的数据值存储到item对象中(前提item中要进行属性的声明),
    2.使用yield关键字将item对象提交给管道
    3.在pipelines.py中进行PipelineproPipeline方法的编写,编写process_item
    4.在配置文件中开启管道

    1.#实例化item对象
    item = PipelineproItem()
    2.在items.py中声明属性
    3.#将解析到的数据值存储到item对象中
    item['author'] = author
    item['content'] = content
    4.#将item对象提交给管道
    yield item
    将数据写入到数据库:新建数据库、表
    select * from qiubai 查看写入的内容
  • 相关阅读:
    Python+SparkStreaming+kafka+写入本地文件案例(可执行)
    Python安装pycurl失败,及解决办法
    Linux screen用法简介
    [算法]数组中求出下标不连续的任意个数,使得和最大
    消息队列小结
    [算法]计算全排列组合数
    [数据结构]A*寻路算法
    [数据结构]最大流之Ford-Fulkerson算法
    [数据结构]最小生成树算法Prim和Kruskal算法
    [数据结构]迪杰斯特拉(Dijkstra)算法
  • 原文地址:https://www.cnblogs.com/lijie123/p/9998441.html
Copyright © 2011-2022 走看看