zoukankan      html  css  js  c++  java
  • scrapy爬虫中间件-urlLength

    浏览器里面能输入的最大url是有限制的

    safari 最多 一万多

    ie最少  2083 

    urllength中间件源码

    谷歌和火狐正常 八千多

    """
    Url Length Spider Middleware
    
    See documentation in docs/topics/spider-middleware.rst
    """
    
    import logging
    
    from scrapy.http import Request
    from scrapy.exceptions import NotConfigured
    
    logger = logging.getLogger(__name__)
    
    
    class UrlLengthMiddleware(object):
    
        def __init__(self, maxlength):
            self.maxlength = maxlength
    
        @classmethod
        def from_settings(cls, settings):
            maxlength = settings.getint('URLLENGTH_LIMIT')
            if not maxlength:
                raise NotConfigured
            return cls(maxlength)
    
        def process_spider_output(self, response, result, spider):
            def _filter(request):
                if isinstance(request, Request) and len(request.url) > self.maxlength:
                    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
                    return False
                else:
                    return True
    
            return (r for r in result or () if _filter(r))
    

      

    scrapy设置了默认的长度 

    如果要自己设置可以在setting里面增加配置

    URLLENGTH_LIMIT = 60

    如果url的长度超过了这个设置 

    会在运行打印日志 忽略这个url请求

    logger.debug("Ignoring link (url length > %(maxlength)d): %(url)s ",
                                 {'maxlength': self.maxlength, 'url': request.url},
                                 extra={'spider': spider})
  • 相关阅读:
    在客户端模拟调用srv和topic
    直流电机测试标准
    vue项目修改host实现地址代理,实现一键登录
    小程序 日期格式化
    ES6学习笔记之async函数
    ES6学习笔记之promise
    ES6学习笔记之箭头函数
    ES6学习笔记之var,let,const
    axios post后台接收不到参数
    vue-cli2配置scss遇到的各种坑
  • 原文地址:https://www.cnblogs.com/php-linux/p/11828999.html
Copyright © 2011-2022 走看看