zoukankan      html  css  js  c++  java
  • Scrapy 扩展中间件: 针对特定响应状态码,使用代理重新请求

    0.参考

    https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect

    https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy

    1.主要实现

    实际爬虫过程中如果请求过于频繁,通常会被临时重定向到登录页面即302,甚至是提示禁止访问即403,因此可以对这些响应执行一次代理请求:

    (1) 参考原生 redirect.py 模块,满足 dont_redirect 或 handle_httpstatus_list 等条件时,直接传递 response

    (2) 不满足条件(1),如果响应状态码为 302 或 403,使用代理重新发起请求

    (3) 使用代理后,如果响应状态码仍为 302 或 403,直接丢弃

    2.代码实现

    保存至 /site-packages/my_middlewares.py

    from w3lib.url import safe_url_string
    from six.moves.urllib.parse import urljoin
    
    from scrapy.exceptions import IgnoreRequest
    
    
    class MyAutoProxyDownloaderMiddleware(object):
    
        def __init__(self, settings):
            self.proxy_status = settings.get('PROXY_STATUS', [302, 403])
            # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=proxy#module-scrapy.downloadermiddlewares.httpproxy
            self.proxy_config = settings.get('PROXY_CONFIG', 'http://username:password@some_proxy_server:port')
    
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                settings = crawler.settings
            )        
    
    
        # See /site-packages/scrapy/downloadermiddlewares/redirect.py
        def process_response(self, request, response, spider):
            if (request.meta.get('dont_redirect', False) or
                    response.status in getattr(spider, 'handle_httpstatus_list', []) or
                    response.status in request.meta.get('handle_httpstatus_list', []) or
                    request.meta.get('handle_httpstatus_all', False)):
                return response
    
            if response.status in self.proxy_status:
                if 'Location' in response.headers:
                    location = safe_url_string(response.headers['location'])
                    redirected_url = urljoin(request.url, location)
                else:
                    redirected_url = ''
                    
                # AutoProxy for first time
                if not request.meta.get('auto_proxy'):
                    request.meta.update({'auto_proxy': True, 'proxy': self.proxy_config})
                    new_request = request.replace(meta=request.meta, dont_filter=True)
                    new_request.priority = request.priority + 2
                    
                    spider.log('Will AutoProxy for <{} {}> {}'.format(
                                response.status, request.url, redirected_url))
                    return new_request
                
                # IgnoreRequest for second time
                else:
                    spider.logger.warn('Ignoring response <{} {}>: HTTP status code still in {} after AutoProxy'.format(
                                        response.status, request.url, self.proxy_status))
                    raise IgnoreRequest
    
            return response

    3.调用方法

    (1) 项目 settings.py 添加代码,注意必须在默认的 RedirectMiddleware 和 HttpProxyMiddleware 之间。

    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    DOWNLOADER_MIDDLEWARES = {
        # 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
        'my_middlewares.MyAutoProxyDownloaderMiddleware': 601,
        # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,    
    }
    PROXY_STATUS = [302, 403]
    PROXY_CONFIG = 'http://username:password@some_proxy_server:port'

    4.运行结果

    2018-07-18 18:42:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None)
    2018-07-18 18:42:38 [test] DEBUG: Will AutoProxy for <302 http://httpbin.org/status/302> http://httpbin.org/redirect/1
    2018-07-18 18:42:43 [test] DEBUG: Will AutoProxy for <403 https://httpbin.org/status/403>
    2018-07-18 18:42:51 [test] WARNING: Ignoring response <302 http://httpbin.org/status/302>: HTTP status code still in [302, 403] after AutoProxy
    2018-07-18 18:42:52 [test] WARNING: Ignoring response <403 https://httpbin.org/status/403>: HTTP status code still in [302, 403] after AutoProxy

    代理服务器 log:

    squid [18/Jul/2018:18:42:53 +0800] "GET http://httpbin.org/status/302 HTTP/1.1" 302 310 "-" "Mozilla/5.0" TCP_MISS:HIER_DIRECT
    squid [18/Jul/2018:18:42:54 +0800] "CONNECT httpbin.org:443 HTTP/1.1" 200 3560 "-" "-" TCP_TUNNEL:HIER_DIRECT
  • 相关阅读:
    怎样查看Oracle的数据库名称sid
    request.getRemoteAddr request.getRemoteHost()
    Oracle中添加自动编号的序列
    google chrome 快捷键
    MyEclipse快捷键大全( 再排版)
    Java正则表达式应用详解
    Spring3.1 Cache注解
    Java本周总结1.
    jquery ui 自动补全
    用字符串的length实现限制文本框长度
  • 原文地址:https://www.cnblogs.com/my8100/p/scrapy_middleware_autoproxy.html
Copyright © 2011-2022 走看看