zoukankan      html  css  js  c++  java
  • Python3---高级应用---代理爬虫

    前言

    文章主要讲解如何通过Python3来进行代理爬虫,使用模块包括urllib.request.HTTPHander(),urllib.request.build_opener(http_handler),ProxyHandler()模块的使用。

    创建时间:20191223

    天象独行

      首先,关于在理解urllib.request.HTTPHander(),urllib.request.build_opener(http_handler)之前,我们需要了解一下urllib.request.urlopen(url,data,timeout)方法,我们都知道urllib.request.urlopen()的作用是发出一个URL的请求。其实本质上urllib.request.urlopen()是一个特殊的opener,也可以理解为Python3语言自带的一个opener。与众不同的是这哥们不支持代理等功能。那么问题来了?如果我需要设定代理来访问就不能使用urllib.request.urlopen()。之前我们提到urllib.request.urlopen()本身就是一个特殊的opener,那么我们能不能自定义一个opener呢。答案当然是可以得。

      创建普通opener:

    import urllib.request
    import urllib.parse
    
    '''
    创建一个opener对象步骤
        1;创建相关的Handler处理器
        2;创建opener对象
    '''
    #创建相关的Handler处理器
    new_handler = urllib.request.HTTPHandler
    #创建自定义opener对象
    new_opener = urllib.request.build_opener(new_handler)
    '''
    创建新的new_opener之后,可以正常向urllib.request.urlopen()一样使用来访问URL。
    举例:
        url = "https://www.cnblogs.com/aaron456-rgv/p/12051754.html"
        new_request = urllib.request.Request(url,headers=heasers)
        result = new_opener.open(new_request)
    '''

      创建代理opener:

    import urllib.request
    import urllib.parse
    import random
    
    '''
    创建一个opener对象步骤
        1;创建相关的Handler处理器
        2;创建opener对象
    '''
    #定义一个代理IP地址列表
    proxy_ip = [
        {"http":"59.57.149.70:9999"},
        {"http":"222.190.222.238:9999"},
        {"http":"183.166.132.224:9999"},
        {"http":"114.239.145.56:808"},
        {"http":"183.166.86.243:9999"},
        {"http":"60.167.22.217:9999"},
        {"http":"117.95.55.151:9999"},
    ]
    #设定随机选择一个代理
    proxy = random.choice(proxy_ip)
    #创建相关的Handler处理器
    new_Proxyhandler = urllib.request.ProxyHandler(proxy)
    #创建自定义opener对象
    new_opener = urllib.request.build_opener(new_Proxyhandler)
    #设定访问地址
    url = "https://www.cnblogs.com/aaron456-rgv/p/12051754.html"
    #设定UA请求字典
    user_agent = [
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
        "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
        "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
        "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
        "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
        "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
        "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
        "UCWEB7.0.2.37/28/999",
        "NOKIA5700/ UCWEB7.0.2.37/28/999",
        "Openwave/ UCWEB7.0.2.37/28/999",
        "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
        # iPhone 6:
        "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
    
    ]
    #随机生成请求头
    new_headers = {"User-Agent": random.choice(user_agent)}
    #生成请求对象
    new_request = urllib.request.Request(url=url,headers=new_headers)
    #发出请求
    new_result = new_opener.open(new_request)
    #打印文件
    print(new_result.read().decode())

      测试结果:

      0X0X;补充说明

      A;Handler处理器可以理解是用来发送和接收数据的模块。

        1、HTTPHandler() :没有任何特殊功能
        2、ProxyHandler(普通代理)
        代理: {"协议":"IP地址:端口号"}
        3、ProxyBasicAuthHandler(密码管理器对象) :私密代理
        4、HTTPBasicAuthHandler(密码管理器对象) : web客户端认证

      B;常见代理:西刺免费代理IP(http://www.xicidaili.com/)   

        免费短期代理网站分高匿和透明

        【高匿】:代表服务器追踪不到你原来的IP;

        【透明】:代表服务器可以追踪到你的代理IP和原来的IP;

        类型表示支持的类型:HTTP或者HTTPS

        【存活的时间】:表示在这个期间可用

  • 相关阅读:
    手机号码正则表达式
    POJ 3233 Matrix Power Series 矩阵快速幂
    UVA 11468
    UVA 1449
    HDU 2896 病毒侵袭 AC自动机
    HDU 3065 病毒侵袭持续中 AC自动机
    HDU 2222 Keywords Search AC自动机
    POJ 3461 Oulipo KMP模板题
    POJ 1226 Substrings KMP
    UVA 1455 Kingdom 线段树+并查集
  • 原文地址:https://www.cnblogs.com/aaron456-rgv/p/12082658.html
Copyright © 2011-2022 走看看