zoukankan      html  css  js  c++  java
  • scrapy 下载器中间件 随机切换user-agent

    下载器中间件如下列表

    ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

     'scrapy.downloadermiddlewares.retry.RetryMiddleware',

     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

     'scrapy.downloadermiddlewares.stats.DownloaderStats']

     

    下载器中间件的四个函数

    from_crawler(cls,crawler) 配置函数

    process_reuquest  处理请求

    process_response 处理响应

    process_exception 异常出现时触发

    随机切换user_agent

    from faker import Faker
    class MySpiderMiddleware(object):
        def __init__(self):
            self.fake = Faker()
    
        def process_request(self,request,spider):
            request.headers.setdefault('User-Agent',self.fake.user_agent())
    DOWNLOADER_MIDDLEWARES = {
    #'middle.middlewares.MyCustomDownloaderMiddleware': 543,
    'middle.middlewares.MySpiderMiddleware': 100,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    }


    第一种方式 在setting 配置里面配置,我也没测试过,到底是一直是随机取其中一个还是每次请求都随机一个

    USER_AGENT_LIST=[
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    USER_AGENT = random.choice(USER_AGENT_LIST)
    

      

    第二种方式 写一个自己的randomUseragent中间件 并且在setting里面启用 ,但是要修改顺序靠前,比如100 或者直接把默认启用的user_agent 设置为None

    第三种方式 直接继承默认的userAgent中间件,然后改写方法 

    中间件可以用faker来实现  或者自己招个列表也可以

    def process_request(self,request,spider):
            request.headers.setdefault('User-Agent',self.fake.user_agent())
  • 相关阅读:
    单片机爬坑记-02-资源紧缺
    单片机爬坑记-01-内核差异
    操作系统-第6章习题解析
    操作系统-第5章习题解析
    操作系统-第4章习题解析
    操作系统-第3章习题解析
    操作系统-第2章习题解析
    操作系统-第1章习题解析
    BugKu之xxx二手交易市场
    BugKu之备份是个好习惯
  • 原文地址:https://www.cnblogs.com/php-linux/p/11829432.html
Copyright © 2011-2022 走看看