zoukankan      html  css  js  c++  java
  • scrapy 随机UserAgent

    通过Scrapy的自有文件我们可以看到内置的UserAgent是如何设置的

    scrapy.downloadermiddlewares.useragent.UserAgentMiddleware

    """Set User-Agent header per spider or use a default value from settings"""
    
    from scrapy import signals
    
    class UserAgentMiddleware:
        """This middleware allows spiders to override the user_agent"""
    
        def __init__(self, user_agent='Scrapy'):
            self.user_agent = user_agent
    
        @classmethod
        def from_crawler(cls, crawler):
            o = cls(crawler.settings['USER_AGENT'])
            crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
            return o
    
        def spider_opened(self, spider):
            self.user_agent = getattr(spider, 'user_agent', self.user_agent)
    
        def process_request(self, request, spider):
            if self.user_agent:
                request.headers.setdefault(b'User-Agent', self.user_agent)
    

    默认的配置

    DOWNLOADER_MIDDLEWARES_BASE = {
        ...
       'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
       ...
    }
    

    首先我们先关闭之前的UserAgent的设置,并添加我们自己的UserAgent

    USER_AGENT = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36', 
                    'Mozilla/5.0 (X11; NetBSD) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.116 Safari/537.36', 
                    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36', 
                    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36', 
                    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36', 
                    'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36', 
                    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36', 
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/4E423F', 
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/4E423F', 
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36']
    
    
    DOWNLOADER_MIDDLEWARES = {
         'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        'crawler.middlewares.RandomUserAgentMiddleware': 500,
    }
    
    from random import choice
    class RandomUserAgentMiddlware(object):
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(crawler)
    
        def process_request(self,request,spider):
            ua = choise(spider.settings["USER_AGENT"])
            request.headers.setdefault(b"User-Agent", ua)
    

    或者直接使用封装好的安装包

    pip install scrapy-fake-useragent
    
    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # 关闭默认方法
        'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 500, # 开启
    }
    
  • 相关阅读:
    启动ZOOKEEPER之后能查看到进程存在但是查不到状态,是因为。。。
    多线程后续讲解及代码测试
    多线程详解和代码测试
    数据操作流
    字符流详解及代码测试
    IO流详解及测试代码
    递归概要及经典案例
    File基本操作
    异常精解
    iOS之多线程NSOperation
  • 原文地址:https://www.cnblogs.com/iFanLiwei/p/13853685.html
Copyright © 2011-2022 走看看