zoukankan      html  css  js  c++  java
  • scrapy反爬虫

    反爬虫措施
    1)动态修改User-Agent
    2)动态修改ip
    3)延迟DOWNLOAD_DELAY = 0.5

    1)在middleware中新建一个类,从fake_useragent中导入UserAgent模块
    from fake_useragent import UserAgent
    class RandomUserAgentMiddleware(object):
    
        @classmethod
        def from_crawler(cls, crawler):
    
            return cls(crawler)
    
        def __init__(self,crawler):
            super(RandomUserAgentMiddleware,self).__init__()
            self.ua=UserAgent()
    
        def process_request(self, request, spider):
    
            request.headers.setdefault(b'User-Agent', self.ua.random)
    
        def spider_opened(self, spider):
            pass
    在settings设置DOWNLOADER_MIDDLEWARES
    先把系统自带的useragent禁用:None

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
        'JobboleSpider.middlewares.RandomUserAgentMiddleware': 543,
    
    }
    
    
    2)动态修改ip
    import random
    class RandomProxyIPMiddleware(object):
    
        @classmethod
        def from_crawler(cls, crawler):
    
            return cls(crawler)
    
        def __init__(self, crawler):
            self.ip_list = [
                "http://180.125.196.155:8888",
                 #ip代理
            ]
    
        def process_request(self, request, spider):
    
            request.meta['proxy']=random.choice(self.ip_list)
    
    
        def spider_opened(self, spider):
            pass
    
    

    3)在settings中设置延迟

    DOWNLOAD_DELAY = 0.5
     



  • 相关阅读:
    [小知识]如何查看IIS6应用程序池所对应的进程ID
    继续向大家汇报服务器情况
    CNBlogs DotText 1.0 Beta 2 重要更新
    垃圾广告记录
    Firefox 11正式发布
    Firefox 10正式发布
    Firefox 6 正式发布
    Firefox 5 正式发布
    Firefox 9正式发布
    Firefox 8正式发布
  • 原文地址:https://www.cnblogs.com/kidl/p/7392540.html
Copyright © 2011-2022 走看看