zoukankan      html  css  js  c++  java
  • scrapy爬虫中间件-urlLength

    浏览器里面能输入的最大url是有限制的

    safari 最多 一万多

    ie最少  2083 

    urllength中间件源码

    谷歌和火狐正常 八千多

    """
    Url Length Spider Middleware
    
    See documentation in docs/topics/spider-middleware.rst
    """
    
    import logging
    
    from scrapy.http import Request
    from scrapy.exceptions import NotConfigured
    
    logger = logging.getLogger(__name__)
    
    
    class UrlLengthMiddleware(object):
    
        def __init__(self, maxlength):
            self.maxlength = maxlength
    
        @classmethod
        def from_settings(cls, settings):
            maxlength = settings.getint('URLLENGTH_LIMIT')
            if not maxlength:
                raise NotConfigured
            return cls(maxlength)
    
        def process_spider_output(self, response, result, spider):
            def _filter(request):
                if isinstance(request, Request) and len(request.url) > self.maxlength:
                    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
                    return False
                else:
                    return True
    
            return (r for r in result or () if _filter(r))
    

      

    scrapy设置了默认的长度 

    如果要自己设置可以在setting里面增加配置

    URLLENGTH_LIMIT = 60

    如果url的长度超过了这个设置 

    会在运行打印日志 忽略这个url请求

    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
  • 相关阅读:
    浏览器 窗口 scrollTop 的兼容性问题
    document.documentElement.scrollTop || document.body.scrollTop;
    javascript函数querySelector
    :before和:after的内幕以及伪类
    css伪类伪元素
    JavaScript 运动框架 Step by step
    js中style,currentStyle和getComputedStyle的区别
    js函数变量
    函数
    oracle语法练习
  • 原文地址:https://www.cnblogs.com/php-linux/p/11828999.html
Copyright © 2011-2022 走看看