zoukankan      html  css  js  c++  java
  • scrapy反爬虫

    反爬虫措施
    1)动态修改User-Agent
    2)动态修改ip
    3)延迟DOWNLOAD_DELAY = 0.5

    1)在middleware中新建一个类,从fake_useragent中导入UserAgent模块
    from fake_useragent import UserAgent
    class RandomUserAgentMiddleware(object):
    
        @classmethod
        def from_crawler(cls, crawler):
    
            return cls(crawler)
    
        def __init__(self,crawler):
            super(RandomUserAgentMiddleware,self).__init__()
            self.ua=UserAgent()
    
        def process_request(self, request, spider):
    
            request.headers.setdefault(b'User-Agent', self.ua.random)
    
        def spider_opened(self, spider):
            pass
    在settings设置DOWNLOADER_MIDDLEWARES
    先把系统自带的useragent禁用:None

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
        'JobboleSpider.middlewares.RandomUserAgentMiddleware': 543,
    
    }
    
    
    2)动态修改ip
    import random
    class RandomProxyIPMiddleware(object):
    
        @classmethod
        def from_crawler(cls, crawler):
    
            return cls(crawler)
    
        def __init__(self, crawler):
            self.ip_list = [
                "http://180.125.196.155:8888",
                 #ip代理
            ]
    
        def process_request(self, request, spider):
    
            request.meta['proxy']=random.choice(self.ip_list)
    
    
        def spider_opened(self, spider):
            pass
    
    

    3)在settings中设置延迟

    DOWNLOAD_DELAY = 0.5
     



  • 相关阅读:
    NC6开发配置流程
    触发器
    U8采购订单联查采购入库单
    sqlserver 游标
    windows服务 定时任务
    ORACLE 导入导出
    laravel 在模板中使select保存的值下单选中
    laravel 中request
    laravel中 url() route() URL::asset()
    laravel 中 后台管理的 路由设计
  • 原文地址:https://www.cnblogs.com/kidl/p/7392540.html
Copyright © 2011-2022 走看看