zoukankan      html  css  js  c++  java
  • UA池和代理池(IP)

    UA池(每一次请求采用池中的随机UA)

    a) 在中间件类中进行导包

    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware 

    b)封装一个基于UserAgentMiddleware的类,且重写该类

      例:

      middleware.py

    from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
    import random
    
    ua_list = ['Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
               'User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
               'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;',
               'User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
               'User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
               'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
               'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
               'User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
               'User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11']
    ip_http_list = ['90.229.216.218:46796', '110.235.250.7:49341', '81.163.62.136:41258', '195.34.207.47:60878']
    ip_https_list = ['140.227.207.211:60088', '140.227.209.210:60088', '185.132.133.102:1080']
    
    
    class UserAgentRandom(UserAgentMiddleware):
        def process_request(self, request, spider):
            ua = random.choice(ua_list)
            request.headers.setdefault('User-Agent', ua)

    settings.py

    DOWNLOADER_MIDDLEWARES = {
       'handle5.middlewares.Handle5DownloaderMiddleware': 543,
       'handle5.middlewares.UserAgentRandom': 542,
       'handle5.middlewares.IpRandom': 541
    }

    代理池(IP 每次请求的IP地址随机从IP池中获取)

    middleware.py

    class IpRandom:
        def process_request(self, request, spider):
            url = request.url
            head = url.split(":")[0]
            if head == "http":
                request.meta["proxy"] = "http://" + random.choice(ip_http_list)
            else:
                request.meta["proxy"] = "https://" + random.choice(ip_https_list)
  • 相关阅读:
    文佳夹之删除
    猜谜小游戏
    python小知识点
    【bzoj4516】[Sdoi2016]生成魔咒 后缀数组+倍增RMQ+STL-set
    【bzoj3362/3363/3364/3365】[Usaco2004 Feb]树上问题杂烩 并查集/树的直径/LCA/树的点分治
    【poj1741】Tree 树的点分治
    【bzoj2946】[Poi2000]公共串 后缀数组+二分
    【bzoj2157】旅游 树链剖分+线段树
    【bzoj2743】[HEOI2012]采花 树状数组
    【bzoj2705】[SDOI2012]Longge的问题 欧拉函数
  • 原文地址:https://www.cnblogs.com/cjj-zyj/p/10208770.html
Copyright © 2011-2022 走看看