zoukankan      html  css  js  c++  java
  • python的爬虫代理设置

    现在网站大部分都是反爬虫技术,最简单就是加代理,写了一个代理小程序。

    # -*- coding: utf-8 -*-
    #__author__ = "雨轩恋i"
    #__date__ = "2018年10月30日"
    
    # 导入random模块
    import random
    # 导入useragent用户代理模块中的UserAgentMiddleware类
    from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
    
    # RotateUserAgentMiddleware类,继承 UserAgentMiddleware 父类
    # 作用:创建动态代理列表,随机选取列表中的用户代理头部信息,伪装请求。
    #       绑定爬虫程序的每一次请求,一并发送到访问网址。
    
    # 发爬虫技术:由于很多网站设置反爬虫技术,禁止爬虫程序直接访问网页,
    #             因此需要创建动态代理,将爬虫程序模拟伪装成浏览器进行网页访问。
    class RotateUserAgentMiddleware(UserAgentMiddleware):
        def __init__(self, user_agent=''):
            self.user_agent = user_agent
    
        def process_request(self, request, spider):
            #这句话用于随机轮换user-agent
            ua = random.choice(self.user_agent_list)
            if ua:
                # 输出自动轮换的user-agent
                print(ua)
                request.headers.setdefault('User-Agent', ua)
    
        # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
        # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
        # 编写头部请求代理列表
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
           ]

    可以在自己的爬虫程序中加入这个程序,每次动态的使用代理,将爬虫程序伪装成浏览器,这样就不会被网站禁止了

  • 相关阅读:
    HDU5732 Subway【树重心 树哈希】
    HDU6311 Cover【欧拉路径 | 回路】
    HDU6370 Werewolf 【基环内向树】
    HDU6321 Dynamic Graph Matching【状压DP 子集枚举】
    HDU6331 Problem M. Walking Plan【Floyd + 矩阵 + 分块】
    HDU6403 Card Game【基环树 + 树形DP】
    HDU5691 Sitting in Line【状压DP】
    Codeforces Round #650 (Div. 3)
    2017-2018 ACM-ICPC, NEERC, Northern Subregional Contest
    Codeforces Round #649 (Div. 2)
  • 原文地址:https://www.cnblogs.com/yuxuanlian/p/9877550.html
Copyright © 2011-2022 走看看