zoukankan      html  css  js  c++  java
  • scrapy爬虫中间件-urlLength

    浏览器里面能输入的最大url是有限制的

    safari 最多 一万多

    ie最少  2083 

    urllength中间件源码

    谷歌和火狐正常 八千多

    """
    Url Length Spider Middleware
    
    See documentation in docs/topics/spider-middleware.rst
    """
    
    import logging
    
    from scrapy.http import Request
    from scrapy.exceptions import NotConfigured
    
    logger = logging.getLogger(__name__)
    
    
    class UrlLengthMiddleware(object):
    
        def __init__(self, maxlength):
            self.maxlength = maxlength
    
        @classmethod
        def from_settings(cls, settings):
            maxlength = settings.getint('URLLENGTH_LIMIT')
            if not maxlength:
                raise NotConfigured
            return cls(maxlength)
    
        def process_spider_output(self, response, result, spider):
            def _filter(request):
                if isinstance(request, Request) and len(request.url) > self.maxlength:
                    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
                    return False
                else:
                    return True
    
            return (r for r in result or () if _filter(r))
    

      

    scrapy设置了默认的长度 

    如果要自己设置可以在setting里面增加配置

    URLLENGTH_LIMIT = 60

    如果url的长度超过了这个设置 

    会在运行打印日志 忽略这个url请求

    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
  • 相关阅读:
    OpenCV 3-1.1-头文件
    安装ROS报错:The following packages have unmet dependenctes:
    机器人学——3.3-逆运动学
    机器人学——3.2-正运动学
    机器人学——3.1-机械臂DH参数
    机器人学——2.4-坐标系的旋转和运动增量
    机器人学——2.3-姿态插值和笛卡尔运动
    面向对象
    数组
    变量总结
  • 原文地址:https://www.cnblogs.com/brady-wang/p/11828999.html
Copyright © 2011-2022 走看看