zoukankan      html  css  js  c++  java
  • Python中的urllib2模块解析

    Name

    urllib2 - An extensible library for opening URLs using a variety of protocols

     

    1. Description

    The simplest way to use this module is to call the urlopen function,which accepts a string containing a URL or a Request object . It opens the URL and returns the results as file-like object.

    2. Classes

        exceptions.IOError(exceptions.EnvironmentError)

            URLError

                HTTPError(URLError, urllib.addinfourl)

        AbstractBasicAuthHandler

            HTTPBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)

            ProxyBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)

        AbstractDigestAuthHandler

        BaseHandler

            AbstractHTTPHandler

                HTTPHandler

                HTTPSHandler

            FTPHandler

                CacheFTPHandler

            FileHandler

            HTTPCookieProcessor

            HTTPDefaultErrorHandler

            HTTPDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)

            HTTPErrorProcessor

            HTTPRedirectHandler

            ProxyDigestAuthHandler(BaseHandler, AbstractDigestAuthHandler)

            ProxyHandler

            UnknownHandler

        HTTPPasswordMgr

            HTTPPasswordMgrWithDefaultRealm

        OpenerDirector

        Request

    3. 两种访问网页模式:

    模式1

      导入模块
      import urllib2
      发送请求
      request = urllib2.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False)
      打开request对象,返回服务器相应对象
      response = urllib2.urlopen(request)
      输出网页代码内容
      print response.read()
      通过构建一个request对象,服务器响应请求得到应答,这样显得逻辑上清晰明确。

    模式2

      导入模块
      import urllib2
      打开url对象,返回服务器相应对象
      response = urllib2.urlopen(url, data=None, timeout=<object object>, cafile=None, capath=None, cadefault=False, context=None)
      输出网页代码内容
      print response.read()

    4. 设置Headers

    很多服务器或代理服务器会查看HTTP头,进而控制网络流量,实现负载均衡,限制不正常用户的访问。所以我们要学会设置HTTP头,来保证一些访问的实现。
    代码如下:
      import urllib 
      import urllib2 
      url = 'http://www.server.com/login'
      user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 
      values = {'username' : 'cqc', 'password' : 'XXXX' } 
      headers = { 'User-Agent' : user_agent } 
      data = urllib.urlencode(values) 
      request = urllib2.Request(url, data, headers) 
      response = urllib2.urlopen(request) 
      page = response.read()


    这样,我们设置了一个headers,在构建request时传入,在请求时,就加入了

    headers传送,服务器若识别了是浏览器发来的请求,就会得到响应。

    常见的User Agent

    1.Android

    Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19
    Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
    Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1

    2.Firefox

    Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0
    Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0

    3.Google Chrome

    Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36
    Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19

    5. 设置代理服务器

    控制代理服务器,防止服务器限制IP。每隔一段时间换一个代理服务器。代理服务器的ip你可以从网页中自己选择和定期更换,控制代理服务器,每隔一段时间换一个代理服务器。代理服务器URL:http://www.xicidaili.com/
    代码如下:
      import urllib2
      enable_proxy = True
      proxy_handler = urllib2.ProxyHandler({"http":"61.135.217.7:80"})
      null_proxy_handler = urllib2.ProxyHandler({})
      if enable_proxy:
        opener = urllib2.build_opener(proxy_handler)
      else:
        opener = urllib2.build_opener(null_proxy_handler)
        urllib2.install_opener(opener)

    6. 超时设置

    urlopen方法第三个参数就是timeout的设置,可以设置等待多久超时,为了解决一些网站实在响应过慢而造成的影响。
      import urllib2
      response = urllib2.urlopen('http://www.baidu.com', timeout=10)

  • 相关阅读:
    企业网站常用中英文对照表
    AttachJSFunction(一个button同时挂两个onclick事件)
    Js 整理
    宝玉的CSS
    网页中一些比较隐蔽的用法 作者:wbc
    flex中flexgrow作用
    flex实现换行内容上下贴边效果
    flex中自动换行设置,以及上下间距的设置?
    flex中aligncontent和aliginitems区别?
    正则表达式语法及实例整理[转]
  • 原文地址:https://www.cnblogs.com/windyrainy/p/10592594.html
Copyright © 2011-2022 走看看