zoukankan      html  css  js  c++  java
  • scrapy 爬虫中间件-offsite和refer中间件

    环境使用anaconda 创建的pyithon3.6环境 


    source activate python36

    mac@macdeMacBook-Pro:~$     source activate python36
    (python36) mac@macdeMacBook-Pro:~$     cd /www
    (python36) mac@macdeMacBook-Pro:/www$     scrapy startproject testMiddlewile
    New Scrapy project 'testMiddlewile', using template directory '/Users/mac/anaconda3/envs/python36/lib/python3.6/site-packages/scrapy/templates/project', created in:
    You can start your first spider with:
        cd testMiddlewile
        scrapy genspider example example.com
    (python36) mac@macdeMacBook-Pro:/www$     cd testMiddlewile/
    (python36) mac@macdeMacBook-Pro:/www/testMiddlewile$        scrapy genspider -t crawl yeves yeves.cn
    Created spider 'yeves' using template 'crawl' in module:
    (python36) mac@macdeMacBook-Pro:/www/testMiddlewile$     



    scrapy crawl yeves


    (python36) mac@macdeMacBook-Pro:/www/testMiddlewile$     scrapy crawl yeves
    2019-11-10 09:10:27 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: testMiddlewile)
    2019-11-10 09:10:27 [scrapy.utils.log] INFO: Versions: lxml, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 13:42:17) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Darwin-17.7.0-x86_64-i386-64bit
    2019-11-10 09:10:27 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'testMiddlewile', 'NEWSPIDER_MODULE': 'testMiddlewile.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['testMiddlewile.spiders']}
    2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet Password: 29995a24067c48f8
    2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled extensions:
    2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
    2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:
    2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled item pipelines:
    2019-11-10 09:10:27 [scrapy.core.engine] INFO: Spider opened
    2019-11-10 09:10:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2019-11-10 09:10:27 [scrapy.extensions.telnet] INFO: Telnet console listening on
    2019-11-10 09:10:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/robots.txt> from <GET http://yeves.cn/robots.txt>
    2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 9 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 10 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 14 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 15 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 21 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 22 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 27 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 28 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 29 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 30 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 31 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 32 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 36 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 37 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 39 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 41 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 42 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 43 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 47 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 48 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 49 without any user agent to enforce it on.
    2019-11-10 09:10:30 [protego] DEBUG: Rule at line 53 without any user agent to enforce it on.
    2019-11-10 09:10:30 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.yeves.cn/> from <GET http://yeves.cn/>
    2019-11-10 09:10:30 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.yeves.cn/robots.txt> (referer: None)
    2019-11-10 09:10:30 [protego] DEBUG: Rule at l


    从上面打印信息可以看到 scrapy默认启动了五个爬虫中间件

    2019-11-10 09:10:27 [scrapy.middleware] INFO: Enabled spider middlewares:


    通过在pycharm 查看源码 先引入

    from scrapy.spidermiddlewares.offsite import  OffsiteMiddleware
    from scrapy.spidermiddlewares.referer import RefererMiddleware
    from scrapy.spidermiddlewares.httperror import  HttpErrorMiddleware
    from scrapy.spidermiddlewares.urllength import  UrlLengthMiddleware
    from scrapy.spidermiddlewares.depth import  DepthMiddleware




    Offsite Spider Middleware
    See documentation in docs/topics/spider-middleware.rst
    import re
    import logging
    import warnings
    from scrapy import signals
    from scrapy.http import Request
    from scrapy.utils.httpobj import urlparse_cached
    logger = logging.getLogger(__name__)
    class OffsiteMiddleware(object):
        def __init__(self, stats):
            self.stats = stats
        def from_crawler(cls, crawler):
            o = cls(crawler.stats)
            crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
            return o
        def process_spider_output(self, response, result, spider):
            for x in result:
                if isinstance(x, Request):
                    if x.dont_filter or self.should_follow(x, spider):
                        yield x
                        domain = urlparse_cached(x).hostname
                        if domain and domain not in self.domains_seen:
                                "Filtered offsite request to %(domain)r: %(request)s",
                                {'domain': domain, 'request': x}, extra={'spider': spider})
                            self.stats.inc_value('offsite/domains', spider=spider)
                        self.stats.inc_value('offsite/filtered', spider=spider)
                    yield x
        def should_follow(self, request, spider):
            regex = self.host_regex
            # hostname can be None for wrong urls (like javascript links)
            host = urlparse_cached(request).hostname or ''
            return bool(regex.search(host))
        def get_host_regex(self, spider):
            """Override this method to implement a different offsite policy"""
            allowed_domains = getattr(spider, 'allowed_domains', None)
            if not allowed_domains:
                return re.compile('')  # allow all by default
            url_pattern = re.compile("^https?://.*$")
            for domain in allowed_domains:
                if url_pattern.match(domain):
                    message = ("allowed_domains accepts only domains, not URLs. "
                               "Ignoring URL entry %s in allowed_domains." % domain)
                    warnings.warn(message, URLWarning)
            domains = [re.escape(d) for d in allowed_domains if d is not None]
            regex = r'^(.*.)?(%s)$' % '|'.join(domains)
            return re.compile(regex)
        def spider_opened(self, spider):
            self.host_regex = self.get_host_regex(spider)
            self.domains_seen = set()
    class URLWarning(Warning):

    __init__ 类初始化

    from_crawler   scrapy 中间件管理所调用的 调用后得到对象

    process_spider_output 处理输出

    should_follow  是否要继续跟踪

    get_host_regex  正则

    spider_opend 为了兼容以前的一个函数

    函数调用流程  from_crawler-》__init__》spider_opend-》get_host_regex

    offsite中间件 就是判断当前要请求的url是否符合爬虫里面定义的运行的域名 防止跳到其他域名去了 

    allowed_domains = ['yeves.cn']

    refer中间件 主要是因为有些图片访问需要提供refer访问来源才能访问,比如阿里云后台oss配置的防止盗链 



    class RefererMiddleware(object):
        def __init__(self, settings=None):
            self.default_policy = DefaultReferrerPolicy
            if settings is not None:
                self.default_policy = _load_policy_class(
        def from_crawler(cls, crawler):
            if not crawler.settings.getbool('REFERER_ENABLED'):
                raise NotConfigured
            mw = cls(crawler.settings)
            # Note: this hook is a bit of a hack to intercept redirections
            crawler.signals.connect(mw.request_scheduled, signal=signals.request_scheduled)
            return mw
        def policy(self, resp_or_url, request):
            Determine Referrer-Policy to use from a parent Response (or URL),
            and a Request to be sent.
            - if a valid policy is set in Request meta, it is used.
            - if the policy is set in meta but is wrong (e.g. a typo error),
              the policy from settings is used
            - if the policy is not set in Request meta,
              but there is a Referrer-policy header in the parent response,
              it is used if valid
            - otherwise, the policy from settings is used.
            policy_name = request.meta.get('referrer_policy')
            if policy_name is None:
                if isinstance(resp_or_url, Response):
                    policy_header = resp_or_url.headers.get('Referrer-Policy')
                    if policy_header is not None:
                        policy_name = to_native_str(policy_header.decode('latin1'))
            if policy_name is None:
                return self.default_policy()
            cls = _load_policy_class(policy_name, warning_only=True)
            return cls() if cls else self.default_policy()
        def process_spider_output(self, response, result, spider):
            def _set_referer(r):
                if isinstance(r, Request):
                    referrer = self.policy(response, r).referrer(response.url, r.url)
                    if referrer is not None:
                        r.headers.setdefault('Referer', referrer)
                return r
            return (_set_referer(r) for r in result or ())
        def request_scheduled(self, request, spider):
            # check redirected request to patch "Referer" header if necessary
            redirected_urls = request.meta.get('redirect_urls', [])
            if redirected_urls:
                request_referrer = request.headers.get('Referer')
                # we don't patch the referrer value if there is none
                if request_referrer is not None:
                    # the request's referrer header value acts as a surrogate
                    # for the parent response URL
                    # Note: if the 3xx response contained a Referrer-Policy header,
                    #       the information is not available using this hook
                    parent_url = safe_url_string(request_referrer)
                    policy_referrer = self.policy(parent_url, request).referrer(
                        parent_url, request.url)
                    if policy_referrer != request_referrer:
                        if policy_referrer is None:
                            request.headers['Referer'] = policy_referrer


    爬虫中间件里面的几个函数 offsite中间件只用到了output

    process_spider_input 3

    process_spider_output 2

    process_start_requests 1


  • 相关阅读:
    SQL中group by的注意事项
    TimeStamp( )函数, TimeStampAdd( )函数 , TimeStampDiff( )函数
    MySQL 练习题目 二刷
  • 原文地址:https://www.cnblogs.com/brady-wang/p/11828957.html
Copyright © 2011-2022 走看看