zoukankan      html  css  js  c++  java
  • 21-爬虫之scrapy框架selenium的使用08

    selenium在scrapy中的使用
    案例:爬取网易新闻中,国内,国际,军事,航空,无人机这五个板块下的所有新闻数据(标题+内容)
    在这里插入图片描述

    基本使用

    创建一个爬虫工程:scrapy startproject proName
    进入工程创建一个基于CrawlSpider的爬虫文件
    scrapy genspider spiderName www.xxx.com
    执行工程:scrapy crawl spiderName

    分析

    • 首页非动态加载的数据
      • 在首页爬取板块对应的url
    • 每一个板块对应的页面中的新闻是动态加载的
      • 爬取新闻标题+详情页url
      • 每一条新闻详情页面中的数据不是动态加载
        • 爬取详情页新闻内容
    • selenium在scrapy中的使用流程
      • 1,在爬虫类中实例化一个浏览器对象,将其作为爬虫类的一个属性
      • 2,在中间件中实现浏览器自动化的相关操作
      • 3,在爬虫类中重写closed(self,spider),再其内部关闭浏览器对象

    settings.py

    # Scrapy settings for SeleniumTest project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'SeleniumTest'
    
    SPIDER_MODULES = ['SeleniumTest.spiders']
    NEWSPIDER_MODULE = 'SeleniumTest.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    LOG_LEVEL = "ERROR"
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'SeleniumTest.middlewares.SeleniumtestSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
       'SeleniumTest.middlewares.SeleniumtestDownloaderMiddleware': 543,
    }
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       'SeleniumTest.pipelines.SeleniumtestPipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    

    middlewares.py

    from scrapy import signals
    from itemadapter import is_item, ItemAdapter
    from scrapy.http import HtmlResponse #scrapy封装好的响应类
    import time
    class SeleniumtestDownloaderMiddleware:
        def process_request(self, request, spider):
    
            return None
    
        # 拦截所有的响应请求
        # 整个工程发起的请求:1+5+n个响应的响应对象也是1+5+n个
        # 只有指定的5个响应对象是不满足需求
        # 只将不满足需求的5个指定的响应对象的响应数据进行篡改
        def process_response(self, request, response, spider):
            # 将所有拦截到的响应对象中指定的5个响应对象找出
            if request.url in spider.model_urls:
                bro = spider.bro
                # response表示的就是指定的不满足需求的5个响应对象
                # 篡改响应数据:首先先获取满足需求的响应数据,将其篡改到响应对象中即可
                # 满足需求的响应数据就可以使用selenium获取
                bro.get(request.url)  # 对五个板块的url发起请求
                time.sleep(2)
                bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
                time.sleep(1)
                # 捕获到了板块页面中加载出来的全部数据(包含了动态加载的数据)
                page_text = bro.page_source
    
                # response.text = page_text
                # return response
    
                # 返回了一个新的响应对象,新的对象替换了原来不满足需求的旧响应对象
                return HtmlResponse(url=request.url, body=page_text, encoding="utf-8", request=request)  # 5
            else:
                return response  # 1+n
    
        def process_exception(self, request, exception, spider):
    
            pass
    
    
    
    
    

    pipelines.py

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # useful for handling different item types with a single interface
    from itemadapter import ItemAdapter
    
    
    class SeleniumtestPipeline:
        def process_item(self, item, spider):
            print(item)
            return item
    
    

    items.py

    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class SeleniumtestItem(scrapy.Item):
        title = scrapy.Field()
        content = scrapy.Field()
    
    
    

    test.py(爬虫源文件)

    import scrapy
    from selenium import webdriver
    from SeleniumTest.items import SeleniumtestItem
    
    class TestSpider(scrapy.Spider):
        name = 'test'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['http://news.163.com/']
        model_urls = [] # 存放每一个板块对应的url
    
        # 实例化一个全局浏览器对象
        bro = webdriver.Chrome()
        # 数据解析:每一个板块对应的url
        def parse(self, response):
            li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
            indexs = [3,4,6,7,8]
            for index in indexs:
                model_li = li_list[index]
                mode_url = model_li.xpath('./a/@href').extract_first()
                self.model_urls.append(mode_url)
            #对每一个板块发起请求
            for url in self.model_urls:
                yield scrapy.Request(url=url,callback=self.parse_model)
    
        # 数据解析:新闻标题+新闻详情页的url(动态加载的数据)
        def parse_model(self,response):
            div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
            for div in div_list:
                title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
                new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()
                if new_detail_url:
                    item = SeleniumtestItem() # 实例化item对象
                    item['title'] = title
                    # 对新闻详情页url发起请求
                    yield scrapy.Request(url=new_detail_url,callback=self.parse_new_detail,meta={'item':item})
        def parse_new_detail(self,response):
            # 解析新闻内容
            content = response.xpath('//*[@id="endText"]/p/text()').extract()
            content = ''.join(content)
            item = response.meta['item']
            item['content'] = content
            yield item
    
        # 关闭浏览器  爬虫类父类的方法,该方法是在爬虫结束最后一刻执行
        def closed(self,spider):
            self.bro.quit()
    
    
    
  • 相关阅读:
    JS复制内容到剪切板
    mysql root密码的重设方法(转)
    php生成excel文件示例代码(转)
    php读取文件内容的三种方式(转)
    使用火蜘蛛采集器Firespider采集天猫商品数据并上传到微店
    Mac Android8.0源码编译笔记
    开源 高性能 高可用 可扩展
    开源 模式
    开源 算法 数据结构
    mdb
  • 原文地址:https://www.cnblogs.com/gemoumou/p/13635324.html
Copyright © 2011-2022 走看看