zoukankan      html  css  js  c++  java
  • scrapy 中间件

    下载中间件的应用

    • scrapy中间件有:-爬虫中间件,下载中间件
    • 下载中间件应用较多
    • 下载中间件:
      • 作用:批量拦截请求和响应
      • 拦截请求:
      • UA伪装:将所有的请求尽可能多的设定成不同的请求载体身份标识
        • request.headers['User-Agent'] = 'xxx'
      • 代理操作:request.meta['proxy'] = 'http://ip:port'
      • 拦截响应:篡改响应数据或者直接替换响应对象

    1 拦截请求中间件

    • 作用:

      • UA伪装:将所有的请求尽可能多的设定成不同的请求载体身份标识
        • request.headers['User-Agent'] = 'xxx'
      • 代理操作:request.meta['proxy'] = 'http://ip:port'
    • 爬取4567视频网示例:

      • spider.py文件
      import scrapy
      from moviespider.items import MoviespiderItem
      
      class MovieSpiderSpider(scrapy.Spider):
          name = 'movie_spider'
          # allowed_domains = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html']
          start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html']
          url = 'https://www.4567tv.tv/index.php/vod/show/class/动作/id/1/page/%d.html'
          pageNum = 1
      
          def parse(self, response):
              li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
              for li in li_list:
                  title = li.xpath('./div[1]/a/@title').extract_first()
                  detail_url = 'https://www.4567tv.tv' + li.xpath('./div[1]/a/@href').extract_first()
                  item = MoviespiderItem()
                  item['title'] = title
                  # meta参数是一个字典,该字典就可以传递给callback指定的回调函数
                  yield scrapy.Request(detail_url, callback=self.parse_detail, meta={"item": item})
      
          def parse_detail(self, response):
              # 接收meta:response.meta
              item = response.meta['item']
              desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()
              item["desc"] = desc
              yield item
      
      • items.py

        • 创建title和desc的属性
      • pipelines.py文件

        • 存储
      • middleware.py

        • downloadmiddleware文件
        from scrapy import signals
        import random
        
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        ]  # 常见浏览器user_agent
        
        PROXY_http = [
            '153.180.102.104:80',
        ]
        PROXY_https = [
            '120.83.49.90:9000',
        ]
        
        
        class MoviespiderDownloaderMiddleware(object):
        
            # 拦截正常的请求,参数request就是拦截到的请求对象
            def process_request(self, request, spider):
                print('i am process_request()')
                # 实现:将拦截到的请求尽可能多的设定成不同的请求载体身份标识
                request.headers['User-Agent'] = random.choice(user_agent_list)
                # 代理操作
                if request.url.split(':')[0] == 'http':
                    request.meta['proxy'] = 'http://' + random.choice(PROXY_http)  # http://ip:port
                else:
                    request.meta['proxy'] = 'https://' + random.choice(PROXY_https)  # http://ip:port
        
                return
        
            # 拦截响应:参数response就是拦截到的响应
            def process_response(self, request, response, spider):
                print('i am process_response()')
                return response
        
            def process_exception(self, request, exception, spider):
                print('i am process_exception()')
                # 拦截到异常的请求然后对其进行修正,然后重新进行请求发送
                # 代理操作
                if request.url.split(':')[0] == 'http':
                    request.meta['proxy'] = 'http://' + random.choice(PROXY_http)  # http://ip:port
                else:
                    request.meta['proxy'] = 'https://' + random.choice(PROXY_https)  # http://ip:port
        
                return request  # 将修正之后的请求进行重新发送
        
        
    • 常用浏览器user_agent

      user_agent_list = [
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
          "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
          "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
          "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
          "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
          "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
          "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
          "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
          "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
          "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
          "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
          "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
          "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
          "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
          "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
          "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
          "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
          "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
          "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
      ]
      

    2 拦截响应中间件

    • 作用:拦截响应可以篡改响应数据或者直接替换响应对象

    • selenium在scrapy中的应用:

      • 实例化浏览器对象:写在爬虫类的构造方法中
      • 关闭浏览器:爬虫类中的closed(self,spider)关闭浏览器
      • 在中间件中执行浏览器自动化的操作
    • 示例:

      • 需求; 爬取网易新闻的国内,国际,军事,航空,无人机这五个板块下对应的新闻标题和内容(数据是动态加载的)

      • spider.py文件

        # -*- coding: utf-8 -*-
        import scrapy
        from selenium import webdriver
        from wangyinews.item import WangyinewsItem
        
        class NewsSpider(scrapy.Spider):
            name = 'news'
            # allowed_domains = ['www.wangyi.com']
            start_urls = ['https://news.163.com']
            five_model_urls = []
            bro = webdriver.Chrome(executable_path=r'D:教学视频python 爬虫	oolschromedriver.exe')
        
            # 用来解析五个板块对应的url,然后对其进行手动请求发送
            def parse(self, response):
                model_index = [3, 4, 6, 7, 8]
                li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
                for index in model_index:
                    li = li_list[index]
                    # 获取了五个板块对应的url
                    model_url = li.xpath('./a/@href').extract_first()
                    self.five_model_urls.append(model_url)
                    # 对每一个板块的url进行手动i请求发送
                    yield scrapy.Request(model_url, callback=self.parse_model)
        
            # 解析每一个板块页面中的新闻标题和新闻详情页的url
            # 问题:response(不满足需求的response)中并没有包含每一个板块中动态加载出的新闻数据
            def parse_model(self, response):
                div_list = response.xpath('/html/body/div[1]/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
                for div in div_list:
                    title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
                    detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()
                    item = WangyinewsItem()
                    item['title'] = title
                    # 对详情页发起请求解析出新闻内容
                    yield scrapy.Request(detail_url, callback=self.parse_new_content, meta={'item': item})
        
            # 获取内容详情
            def parse_new_content(self, response):
                item = response.meta["item"]
                content = response.xpath('//*[@id="endText"]//text()').extract()
                item["content"] = content
                yield item
        
            # 最后执行,关闭bro
            def close(self, spider):
                self.bro.quit()
        
        
      • items.py

        • 创建title和desc的属性
      • pipelines.py文件

        • 存储
      • middleware.py文件:

        # -*- coding: utf-8 -*-
        
        # Define here the models for your spider middleware
        #
        # See documentation in:
        # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
        from time import sleep
        
        from scrapy import signals
        from scrapy.http import HtmlResponse
        
        
        class WangyinewsDownloaderMiddleware(object):
        
            def process_request(self, request, spider):
                return None
        
            def process_response(self, request, response, spider):
                # spider就是爬虫文件中爬虫类实例化的对象
                # 进行所有响应对象的拦截
                # 1.将所有的响应中那五个不满足需求的响应对象找出
                # 1.每一个响应对象对应唯一一个请求对象
                # 2.如果我们可以定位到五个响应对应的请求对象后,就可以通过该i请求对象定位到指定的响应对象
                # 3.可以通过五个板块的url定位请求对象
                # 总结: url==》request==》response
                # 2.将找出的五个不满足需求的响应对象进行修正(替换)
                # spider.five_model_urls:五个板块对应的url
                bro = spider.bro
                if request.url in spider.five_model_urls:
                    bro.get(request.url)
                    sleep(1)
                    page_text = bro.page_source  # 包含了动态加载的新闻数据
                    # 如果if条件程利则该response就是五个板块对应的响应对象
                    new_responde = HtmlResponse(url=request.url, body=page_text, encoding="utf-8", request=request)
                    return new_responde
                return response
        
            def process_exception(self, request, exception, spider):
                pass
        
        

  • 相关阅读:
    让Flask-admin支持markdown编辑器
    单例模式
    【Python】关于如何判断一个list是否为空的思考
    【Python】抽象工厂模式
    【Python篇】工厂模式
    【Python】直接赋值,深拷贝和浅拷贝
    【Python】可变对象和不可变对象
    【Python】__name__ 是什么?
    【Python】any() 或者 or
    [Python] list vs tupple
  • 原文地址:https://www.cnblogs.com/bigox/p/11447963.html
Copyright © 2011-2022 走看看