zoukankan      html  css  js  c++  java
  • 20-爬虫之scrapy框架CrawlSpider07

    CrawlSpider

    是Spider的一个子类,Spider是爬虫文件中的爬虫父类
    - 之类的功能一定是对于父类

    • 作用:被作用于专业实现全站数据爬取
      • 将一个页面下的所有页码对应的数据进行爬取
    • 基本使用
      • 创建一个爬虫工程:scrapy startproject proName
      • 进入工程创建一个基于CrawlSpider的爬虫文件
        • scrapy genspider -t crawl spiderName www.xxx.com
      • 执行工程:scrapy crawl spiderName

    注意

    • 一个链接提取器对应一个规则解析器(多个链接提取器和多个规则解析器)
    • 在实现深度爬取的过程中,需要和scrapy.Request()结合使用
    • link = LinkExtractor(allow=r’’)# allow是空follow是True那么我们就能取出全站所有链接

    普通爬取

    在这里插入图片描述

    爬虫源码

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    
    class TestSpider(CrawlSpider):
        name = 'test'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['http://www.521609.com/daxuemeinv/']
    
        # 链接提取器:根据指定规则(allow参数)在页面中进行链接(url)提取
        # allow = "正则":提取链接的规则
        link = LinkExtractor(allow=r'list8d+.html')# 实例化LinkExtractor对象
        # link = LinkExtractor(allow=r'')# allow是空follow是True那么我们就能取出全站所有链接
        rules = (
            # 是实例化一个Rule对象
            #规则解析器:接收链接提取器提取到的链接,对其发起请求,然后根据指定规则(callback)解析数据
            Rule(link, callback='parse_item', follow=True),
        )
        # follow = True
        # 将链接提取器,继续作用到链接 提取器提取到的页码所对应的页面中
        def parse_item(self, response):
            print(response)
            # 基于response实现数据解析
    
    
    

    在这里插入图片描述

    深度爬取

    CrawlSpider实现深度爬取

    • 通用方式:CrawlSpider + Spider实现深度爬取

    • 创建一个爬虫工程:scrapy startproject proName

    • 进入工程创建一个基于CrawlSpider的爬虫文件

      • scrapy genspider -t crawl spiderName www.xxx.com
    • 执行工程:scrapy crawl spiderName

    在这里插入图片描述

    settings.py

    # Scrapy settings for sunPro project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'sunPro'
    
    SPIDER_MODULES = ['sunPro.spiders']
    NEWSPIDER_MODULE = 'sunPro.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    LOG_LEVEL = "ERROR"
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'sunPro.middlewares.SunproSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'sunPro.middlewares.SunproDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'sunPro.pipelines.SunproPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    

    items.py

    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class SunproItem(scrapy.Item):
        title = scrapy.Field()
        status = scrapy.Field()
    
    class SunProItemDetail(scrapy.Item):
        content = scrapy.Field()
    
    
    

    pipelines.py

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # useful for handling different item types with a single interface
    from itemadapter import ItemAdapter
    
    
    class SunproPipeline:
        def process_item(self, item, spider):
            if item.__class__.__name__=='SunproItem':
                title = item['title']
                status = item['status']
                print(title+":"+status)
            else:
                content = item['content']
                print(content)
    
            return item
    
    

    sun.py 爬虫源文件

    此方法进行数据爬取,,持久化存储目前无法将title和content进行一一匹配,我们需要手动发起请求

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from sunPro.items import SunproItem,SunProItemDetail
    
    
    
    class TestSpider(CrawlSpider):
        name = 'sun'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
    
        # 链接提取器:根据指定规则(allow参数)在页面中进行链接(url)提取
        # allow = "正则":提取链接的规则
        link = LinkExtractor(allow=r'id=1&page=d+')# 实例化LinkExtractor对象
        link_detail = LinkExtractor(allow=r'dindex?id=d+') #详情页url
        # link = LinkExtractor(allow=r'')# allow是空follow是True那么我们就能取出全站所有链接
        rules = (
            # 是实例化一个Rule对象
            #规则解析器:接收链接提取器提取到的链接,对其发起请求,然后根据指定规则(callback)解析数据
            Rule(link, callback='parse_item', follow=True),
            Rule(link_detail, callback='parse_detail'),
        )
        # follow = True
        # 将链接提取器,继续作用到链接 提取器提取到的页码所对应的页面中
        #标题&状态
        def parse_item(self, response):
            li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
            for li in li_list:
                title = li.xpath('./span[3]/a/text()').extract_first()
                status = li.xpath('./span[2]/text()').extract_first()
                item = SunproItem()
                item['title']=title
                item['status'] =status
                yield item
            # 实现深度爬取:爬取详情页中的数据
            # 1,对详情页的url进行捕获
            # 2,对详情页的url发起请求获取数据
        def parse_detail(self,response):
            content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()
            item = SunProItemDetail()
            item['content']=content
            yield item
            # 爬虫文件会向管道提交两个不同形式的item,管道会接受到两个不同形式的item。
            #我们需要在管道中判断接收到item到底是哪一个
            #此方法进行数据爬取,,持久化存储目前无法将title和content进行一一匹配,我们需要手动发起请求
    
    
    
    

    CrawlSpider+Spider全站深度爬取

    items.py

    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class SunproItem(scrapy.Item):
        title = scrapy.Field()
        status = scrapy.Field()
        content = scrapy.Field()
    
    
    

    pipelines.py

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # useful for handling different item types with a single interface
    from itemadapter import ItemAdapter
    
    
    class SunproPipeline:
        def process_item(self, item, spider):
            print(item)
            return item
    
    

    sun.py 爬虫源文件

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    from sunPro.items import SunproItem
    
    
    
    class TestSpider(CrawlSpider):
        name = 'sun'
        #allowed_domains = ['www.xxx.com']
        start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']
    
        # 链接提取器:根据指定规则(allow参数)在页面中进行链接(url)提取
        # allow = "正则":提取链接的规则
        link = LinkExtractor(allow=r'id=1&page=d+')# 实例化LinkExtractor对象
        #link_detail = LinkExtractor(allow=r'dindex?id=d+') #详情页url
    
        rules = (
            # 是实例化一个Rule对象
            #规则解析器:接收链接提取器提取到的链接,对其发起请求,然后根据指定规则(callback)解析数据
            Rule(link, callback='parse_item', follow=True),
           # Rule(link_detail, callback='parse_detail'),
        )
        # #follow = True
        # 将链接提取器,继续作用到链接 提取器提取到的页码所对应的页面中
        #标题&状态
        def parse_item(self, response):
            li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li')
            for li in li_list:
                title = li.xpath('./span[3]/a/text()').extract_first()
                status = li.xpath('./span[2]/text()').extract_first()
                detail_url ="http://wz.sun0769.com" + li.xpath('./span[3]/a/@href').extract_first()#详情页url
                item = SunproItem()
                item['title'] = title
                item['status'] = status
                yield scrapy.Request(url=detail_url,callback=self.parse_detail,meta={'item':item})
        def parse_detail(sele,response):
            content = response.xpath('/html/body/div[3]/div[2]/div[2]/div[2]/pre/text()').extract_first()
            item = response.meta['item']
            item['content'] = content
            yield item
    
    
    
    
    
    
    

    在这里插入图片描述

  • 相关阅读:
    Jedis scan及其count的值
    redis中KEYS、SMEMBERS、SCAN 、SSCAN 的区别
    Windows环境下RabbitMQ的启动和停止命令
    HTTP状态码->HTTP Status Code
    给所有的input trim去空格
    git clone 使用用户名和密码
    ABA问题
    FIFO、LRU、LFU的含义和原理
    【phpstorm】破解安装
    【windows7】解决IIS 80端口占用问题(亲测)
  • 原文地址:https://www.cnblogs.com/gemoumou/p/13635325.html
Copyright © 2011-2022 走看看