zoukankan      html  css  js  c++  java
  • Scrapy 设置随机 User-Agent

    方式一:在每个 Spider中设置(针对单个Spider)

    class TencentSpider(scrapy.Spider):
        name = 'tencent'
        allowed_domains = ['hr.tencent.com']
    
       
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
        }
    
        def parse(self, response):
            # 找到最后一页的页码
            page_num = response.xpath('//div[@class="pagenav"]/a[last()-1]/text()').extract()[0]
            # 生成每一页的请求
            for i in range(1, int(page_num) + 1):
                url = "https://hr.tencent.com/position.php?&start=%s#a" % (i * 10)
                yield Request(url=url, headers=TencentSpider.headers, callback=self.parse)

    方式二: 在中间件中设置(全局)

    在配置文件中设置User-Agent集合

    # 请求头
    CUSTOM_USER_AGENT = [
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
        "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
        "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
        "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
        "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
        "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
        "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
        "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    ]
    View Code

    编写中间件逻辑

    # 请求头添加随机user-agent
    class RandomUserAgentMiddleware(object):
    
        def __init__(self, agents):
            self.agent = agents
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                agents=crawler.settings.get('CUSTOM_USER_AGENT')
            )
    
        def process_request(self, request, spider):
            request.headers.setdefault('User-Agent', random.choice(self.agent))
    View Code

    激活中间件,并禁用默认的User-Agent中间件

    DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
        'day1.middlewares.RandomUserAgentMiddleware': 10,
    }

     查看请求头信息

    响应 response 中封装了请求对应request,因此可以根据request中查看该响应请求时的请求头信息。

        def parse(self, response):
            print(response.request)
            print(response.request.headers['User-Agent'])

    使用 fake-useragent 模块随机生成 User-Agent

    上面的user-agent是在配置文件中预先设定好的,我们也可以使用python模块 fake-useragent  生成user-agent

    安装:

    pip install fake-useragent

    简单使用:

    from fake_useragent import UserAgent
    ua = UserAgent()
    #ie浏览器的user agent
    print(ua.ie)
    Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; chromeframe/13.0.782.215)

    #opera浏览器 print(ua.opera) #chrome浏览器 print(ua.chrome) #firefox浏览器 print(ua.firefox) #safri浏览器 print(ua.safari) #最常用的方式 #写爬虫最实用的是可以随意变换headers,一定要有随机性。支持随机生成请求头 print(ua.random)

    在中间件中使用

    import random
    from scrapy import signals
    from fake_useragent import UserAgent
    
    class RandomUserAgentMiddleware(object):
    
        def __init__(self):
            self.agent = UserAgent()
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls()
    
        def process_request(self, request, spider):
            request.headers.setdefault('User-Agent', self.agent.random)
  • 相关阅读:
    严重: Parse error in application web.xml file at jndi:/localhost/ipws/WEBINF/web.xml java.lang.NoSuchMethodException: org.apache.catalina.deploy.WebXml
    Failed to install .apk on device 'emulator5554': timeout解决方法
    java.lang.NoClassDefFoundError:org.jsoup.Jsoup
    Conversion to Dalvik format failed: Unable to execute dex:解决方法
    apache Digest: generating secret for digest authentication ...
    Description Resource Path Location Type Project has no default.properties file! Edit the project properties to set one.
    android service随机自启动
    MVC3 安装部署
    EF 4.3 CodeBased 数据迁移演练
    SQL Server 2008开启sa账户
  • 原文地址:https://www.cnblogs.com/yuqiangli0616/p/9277263.html
Copyright © 2011-2022 走看看