zoukankan      html  css  js  c++  java
  • scrapy爬虫中间件-urlLength

    浏览器里面能输入的最大url是有限制的

    safari 最多 一万多

    ie最少  2083 

    urllength中间件源码

    谷歌和火狐正常 八千多

    """
    Url Length Spider Middleware
    
    See documentation in docs/topics/spider-middleware.rst
    """
    
    import logging
    
    from scrapy.http import Request
    from scrapy.exceptions import NotConfigured
    
    logger = logging.getLogger(__name__)
    
    
    class UrlLengthMiddleware(object):
    
        def __init__(self, maxlength):
            self.maxlength = maxlength
    
        @classmethod
        def from_settings(cls, settings):
            maxlength = settings.getint('URLLENGTH_LIMIT')
            if not maxlength:
                raise NotConfigured
            return cls(maxlength)
    
        def process_spider_output(self, response, result, spider):
            def _filter(request):
                if isinstance(request, Request) and len(request.url) > self.maxlength:
                    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
                    return False
                else:
                    return True
    
            return (r for r in result or () if _filter(r))
    

      

    scrapy设置了默认的长度 

    如果要自己设置可以在setting里面增加配置

    URLLENGTH_LIMIT = 60

    如果url的长度超过了这个设置 

    会在运行打印日志 忽略这个url请求

    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
  • 相关阅读:
    golang mod 导包
    grpc client 报错: code = Unimplemented desc = method *** not implemented
    golang读取email
    docker 使用
    在word中批量更改Mathtype公式的格式
    word中插入myth type公式行距变大的问题
    word中编辑论文公式对齐问题
    向别人学习
    机器学习 博文汇总
    matlab中如何用rand产生相同的随机数
  • 原文地址:https://www.cnblogs.com/php-linux/p/11828999.html
Copyright © 2011-2022 走看看