zoukankan      html  css  js  c++  java
  • python 中的爬虫· scrapy框架 重要的组件的介绍

    一 。  去重的规则组件  

      

      去重数据,中通过set() 去重的, 留下的数据存在redis 中,

      找到这个类  : from scrapy.dupefilter import RFPDupeFilter  
         

              a. 爬虫中yield Request(...dont_filter=False) 
    		
    		b. 类 
    			from scrapy.dupefilter import BaseDupeFilter
    			import redis
    			from scrapy.utils.request import request_fingerprint
    
    			class XzxDupefilter(BaseDupeFilter):
    
    				def __init__(self,key):
    					self.conn = None
    					self.key = key
    
    				@classmethod
    				def from_settings(cls, settings):
    					key = settings.get('DUP_REDIS_KEY')
    					return cls(key)
    
    				def open(self):
    					self.conn = redis.Redis(host='127.0.0.1',port=6379)
    
    				def request_seen(self, request):
    					fp = request_fingerprint(request)
    					added = self.conn.sadd(self.key, fp)
    					return added == 0
    		c. settings中配置
    			# 默认dupefilter
    			# DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
    			DUPEFILTER_CLASS = 'xzx.dupfilter.XzxDupefilter'  # 可以自定义的 
    		

      这个类给url 添加一个唯一的标识:

         from scrapy.utils.request import request_fingerprint

            补充:调度器中有一段代码来规定
    			    def enqueue_request(self, request):
    					# dont_filter=True, =>  False  -> 添加到去重规则:False,True
    					# dont_filter=False, => True  -> 添加到去重规则: False,True
    					if not request.dont_filter and self.df.request_seen(request):
    						
    						return False
    						
    					# 添加到调度器
    					dqok = self._dqpush(request)
    

     二 。调度器

      1. 广度优先 (本质就是栈)

      2.深度优先 (本质就是队列)

      3. 优先级队列 (redis的有序集合)

    三  下载中间件

      这个中间件事 调度器 于 下载器之间的中间件。

       

    a.     scrapy中下载中间件的作用?
    			统一对所有请求批量对request对象进行下载前的预处理。
    			
    			
    		b. 针对user-agent,默认中间件 内置的默认的执行, 获取的是stettings 中自己配置的user-agent
    		
    			class UserAgentMiddleware(object):
    				"""This middleware allows spiders to override the user_agent"""
    
    				def __init__(self, user_agent='Scrapy'):
    					self.user_agent = user_agent # USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    
    				@classmethod
    				def from_crawler(cls, crawler):
    					o = cls(crawler.settings['USER_AGENT'])
    					crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
    					return o
    
    				def spider_opened(self, spider):
    					self.user_agent = getattr(spider, 'user_agent', self.user_agent)
    
    				def process_request(self, request, spider):
    					if self.user_agent:
    						request.headers.setdefault(b'User-Agent', self.user_agent)
    
    			
    		c. 关于重定向   内置对的默认的
    			
    			class BaseRedirectMiddleware(object):
    
    				enabled_setting = 'REDIRECT_ENABLED'
    
    				def __init__(self, settings):
    					if not settings.getbool(self.enabled_setting):
    						raise NotConfigured
    
    					self.max_redirect_times = settings.getint('REDIRECT_MAX_TIMES')
    					self.priority_adjust = settings.getint('REDIRECT_PRIORITY_ADJUST')
    
    				@classmethod
    				def from_crawler(cls, crawler):
    					return cls(crawler.settings)
    
    				def _redirect(self, redirected, request, spider, reason):
    					ttl = request.meta.setdefault('redirect_ttl', self.max_redirect_times)
    					redirects = request.meta.get('redirect_times', 0) + 1
    
    					if ttl and redirects <= self.max_redirect_times:
    						redirected.meta['redirect_times'] = redirects
    						redirected.meta['redirect_ttl'] = ttl - 1
    						redirected.meta['redirect_urls'] = request.meta.get('redirect_urls', []) + 
    							[request.url]
    						redirected.dont_filter = request.dont_filter
    						redirected.priority = request.priority + self.priority_adjust
    						logger.debug("Redirecting (%(reason)s) to %(redirected)s from %(request)s",
    									 {'reason': reason, 'redirected': redirected, 'request': request},
    									 extra={'spider': spider})
    						return redirected
    					else:
    						logger.debug("Discarding %(request)s: max redirections reached",
    									 {'request': request}, extra={'spider': spider})
    						raise IgnoreRequest("max redirections reached")
    
    				def _redirect_request_using_get(self, request, redirect_url):
    					redirected = request.replace(url=redirect_url, method='GET', body='')
    					redirected.headers.pop('Content-Type', None)
    					redirected.headers.pop('Content-Length', None)
    					return redirected
    
    
    			class RedirectMiddleware(BaseRedirectMiddleware):
    				"""
    				Handle redirection of requests based on response status
    				and meta-refresh html tag.
    				"""
    				def process_response(self, request, response, spider):
    					if (request.meta.get('dont_redirect', False) or
    							response.status in getattr(spider, 'handle_httpstatus_list', []) or
    							response.status in request.meta.get('handle_httpstatus_list', []) or
    							request.meta.get('handle_httpstatus_all', False)):
    						return response
    
    					allowed_status = (301, 302, 303, 307, 308)
    					if 'Location' not in response.headers or response.status not in allowed_status:
    						return response
    
    					location = safe_url_string(response.headers['location'])
    
    					redirected_url = urljoin(request.url, location)
    
    					if response.status in (301, 307, 308) or request.method == 'HEAD':
    						redirected = request.replace(url=redirected_url)
    						return self._redirect(redirected, request, spider, response.status)
    
    					redirected = self._redirect_request_using_get(request, redirected_url)
    					return self._redirect(redirected, request, spider, response.status)
    
    			
    		d. 关于cookie   是内置的默认的就执行
      
                用法 自己写的逻辑里 yield 加上meta={“cookieJar”:1}}:
            
    def start_requests(self):
    for url in self.start_urls:
    yield Request(url=url,callback=self.parse,meta={"cookieJar":1})

    class CookiesMiddleware(object): """This middleware enables working with sites that need cookies""" def __init__(self, debug=False): self.jars = defaultdict(CookieJar) self.debug = debug @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool('COOKIES_ENABLED'): raise NotConfigured return cls(crawler.settings.getbool('COOKIES_DEBUG')) def process_request(self, request, spider): if request.meta.get('dont_merge_cookies', False): return # cookiejarkey = 1 cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] # CookieJar对象-> 空容器 cookies = self._get_request_cookies(jar, request) for cookie in cookies: jar.set_cookie_if_ok(cookie, request) # set Cookie header request.headers.pop('Cookie', None) jar.add_cookie_header(request) self._debug_cookie(request, spider) def process_response(self, request, response, spider): if request.meta.get('dont_merge_cookies', False): return response # extract cookies from Set-Cookie and drop invalid/expired cookies cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] jar.extract_cookies(response, request) self._debug_set_cookie(response, spider) return response def _debug_cookie(self, request, spider): if self.debug: cl = [to_native_str(c, errors='replace') for c in request.headers.getlist('Cookie')] if cl: cookies = " ".join("Cookie: {} ".format(c) for c in cl) msg = "Sending cookies to: {} {}".format(request, cookies) logger.debug(msg, extra={'spider': spider}) def _debug_set_cookie(self, response, spider): if self.debug: cl = [to_native_str(c, errors='replace') for c in response.headers.getlist('Set-Cookie')] if cl: cookies = " ".join("Set-Cookie: {} ".format(c) for c in cl) msg = "Received cookies from: {} {}".format(response, cookies) logger.debug(msg, extra={'spider': spider}) def _format_cookie(self, cookie): # build cookie string cookie_str = '%s=%s' % (cookie['name'], cookie['value']) if cookie.get('path', None): cookie_str += '; Path=%s' % cookie['path'] if cookie.get('domain', None): cookie_str += '; Domain=%s' % cookie['domain'] return cookie_str def _get_request_cookies(self, jar, request): if isinstance(request.cookies, dict): cookie_list = [{'name': k, 'value': v} for k, v in six.iteritems(request.cookies)] else: cookie_list = request.cookies cookies = [self._format_cookie(x) for x in cookie_list] headers = {'Set-Cookie': cookies} response = Response(request.url, headers=headers) return jar.make_cookies(response, request) 默认中间件: DOWNLOADER_MIDDLEWARES_BASE = { # Engine side 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, # Downloader side }

        注意点:

        process_request   不用返回,

            1. 如果 有返回response,就会找最后一个process—ressponse

            2. 如果返回request , 就到直接根据返回的request 到调度器中执行

        process_response:必须有返回值

     

    四  。 爬虫中间件

      下载器组件 到 爬虫组件中间件,

      默认有 优先级的中间件 和 深度的中间件

    编写中间件
    			
    			class XzxSpiderMiddleware(object):
    				# Not all methods need to be defined. If a method is not defined,
    				# scrapy acts as if the spider middleware does not modify the
    				# passed objects.
    
    				@classmethod
    				def from_crawler(cls, crawler):
    					# This method is used by Scrapy to create your spiders.
    					s = cls()
    					return s
    
    				def process_spider_input(self, response, spider):
    					# Called for each response that goes through the spider
    					# middleware and into the spider.
    
    					# Should return None or raise an exception.
    					return None
    
    				def process_spider_output(self, response, result, spider):
    					# Called with the results returned from the Spider, after
    					# it has processed the response.
    
    					# Must return an iterable of Request, dict or Item objects.
    					for i in result:
    						yield i
    
    				def process_spider_exception(self, response, exception, spider):
    					# Called when a spider or process_spider_input() method
    					# (from other spider middleware) raises an exception.
    
    					# Should return either None or an iterable of Response, dict
    					# or Item objects.
    					pass
    
    				def process_start_requests(self, start_requests, spider):
    					# Called with the start requests of the spider, and works
    					# similarly to the process_spider_output() method, except
    					# that it doesn’t have a response associated.
    
    					# Must return only requests (not items).
    					for r in start_requests:
    						yield r
    
    
    		配置文件:
    			SPIDER_MIDDLEWARES = {
    			   'xzx.middlewares.XzxSpiderMiddleware': 543,
    			}
    						
    		内置爬虫中间件 settings 中的配置 :
    			深度 :
    				DEPTH_LIMIT = 8
    			优先级
    				DEPTH_PRIORITY = 1, 请求的优先级:0 -1  -2  -3 。。。。
    				DEPTH_PRIORITY = -1,请求的优先级:0 1 2 3 。。。。
    				
    			SPIDER_MIDDLEWARES_BASE = {
    				# Engine side
    				'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
    				'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
    				'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
    				'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
    				'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
    				# Spider side
    			}
    

      

    总结:

    1. DupeFilter


    - 默认放在set集合
    - url变更为唯一标记
    - 将去重规则放到redis中的意义何在?
    - 去重+dont_filter



    2. 调度器


    - 爬虫中什么是深度和广度优先?
    - 用什么可以实现?
    - 栈
    - 队列
    - 优先级集合

    3,开放封闭原则:

      对源码封闭,对配置文件开放, 通过修改配置文件,实现自己想要的功能.


  • 相关阅读:
    关于p标签
    用unescape反编码得出汉字
    一个未知高度垂直居中的简单方法
    发现个div float的小秘密
    w3cschool关于list-style-position时的另外发现
    oracle 11gR2默认密码修改
    程序员的十楼层。看看自己在第几层
    Steve Yegge:Google面试秘籍
    为学Linux,我看了这些书
    程序员的困境
  • 原文地址:https://www.cnblogs.com/xuerh/p/9348849.html
Copyright © 2011-2022 走看看