【python】urllib2

zoukankan html css js c++ java

【python】urllib2
1 urllib2.urlopen(url[, data][, timeout])
请求url，获得请求数据，url参数可以是个String，也可以是个Request参数

没有data参数时为GET请求，设置data参数时为POST请求，另外data格式必须为application/x-www-form-urlencoded，urllib.urlencode()能够设置请求参数的编码，data是字典，需要经urllib.urlencode()编码

timeout设置请求阻塞的超时时间，如果没有设置的话，会使用全局默认timeout参数；该参数只对HTTP、HTTPS、FTP生效

This function returns a file-like object with three additional methods:
- geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
- info() — return the meta-information of the page, such as headers, in the form of an mimetools.Message instance (see Quick Reference to HTTP Headers)
- getcode() — return the HTTP status code of the response
1 class OpenerDirector
管理一系列的Handler，这些handler都有自己的功能实现和协议，后面会提到大量的Handler功能
1 urllib2.build_opener([handler, ...])
返回OpenerDirector实例，实现了BaseHandler都可以生成Handler实例。Python已经内建许多的Handler，你可以替换或者添加新的Handler。

内建Handler如下：

ProxyHandler：处理代理操作

UnknownHandler：Raise URLError异常

HTTPHandler：处理HTTP的GET和POST操作

HTTPDefaultErrorHandler：处理HTTP Error的通用处理，所有的响应都会抛出HTTPError异常

HTTPRedirectHandler：处理HTTP重定向操作，如301、302、303等和HEAD请求的307都会执行重定向操作

FTPHandler：处理FTP操作

FileHandler：处理文件

HTTPErrorProcessor：处理非200异常

除去上面这些Handler，urllib2还有一些其它的Handler可供选择，这些Handler都能根据名称知晓其功能，不细作解释，包括但不仅限于：

HTTPCookieProcessor：处理cookie
HTTPBasicAuthHandler：处理Auth
ProxyBasicAuthHandler：处理Proxy和Auth
HTTPDigestAuthHandler：处理DigestAuth
ProxyDigestAuthHandler：处理ProxyDigest
HTTPSHandler：处理HTTPS请求
CacheFTPHandler：比FTPHandler多点功能。

urllib2对于opener的使用：
1 urllib2.install_opener(opener)
定义全局的OpenerDirector，如果执行这个方法，会把自己定义的Handler用在后续的URL处理上。
1 class urllib2.Request(url[, data][, headers][, origin_req_host][, unverifiable])
url和data的内容和前面的一致，添加了headers的信息，header的内容可以参考http://isilic.iteye.com/blog/1801072

origin_req_host应该是请求的服务器Host地址，unverifiable参数表明请求是否可验证

基本用法：

1）
1 import urllib2 2 f = urllib2.urlopen('http://www.python.org/') 3 print f.read(100)
2）
1 import urllib2 2 req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi',data='Committed Data') 3 f = urllib2.urlopen(req) 4 print f.read()
3）
1 import urllib 2 import urllib2 3 url = 'http://www.server.com/cgi-bin/register.cgi' 4 user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' 5 values = {'name' : 'Michael','language' : 'Python' } 6 headers = { 'User-Agent' : user_agent } 7 data = urllib.urlencode(values) 8 req = urllib2.Request(url, data, headers) 9 f = urllib2.urlopen(req) 10 print f.read()
Proxy的使用相当广泛，对于单个应用来说，爬虫是很容易被封禁，如果使用Proxy模式，就能降低被封的风险，所以有需求的同学需要仔细看下Python urllib2对于Proxy的使用：
1 import urllib2 2 proxy_handler = urllib2.ProxyHandler({'http': '127.0.0.1:80'}) //使用本机80端口的代理访问谷歌的内容 3 opener = urllib2.build_opener(proxy_handler) 4 urllib2.install_opener(opener) 5 f = urllib2.urlopen('http://www.google.com') 6 print f.read()
注意这个Proxy会将proxy_handler作为全局的ProxyHandler，这个未必是我们需要的，如果我们需要使用不同的Proxy，这个设置就有问题，需要修改为以下Proxy使用方式：
1 import urllib2 2 proxy_handler = urllib2.ProxyHandler({'http': '127.0.0.1:80'}) 3 opener = urllib2.build_opener(proxy_handler) 4 f = opener.open(url) 5 print f.read()
使用多个代理：
1 import urllib2 2 proxyList=('211.167.112.14:80', 3 '210.32.34.115:8080', 4 '115.47.8.39:80', 5 '211.151.181.41:80', 6 '219.239.26.23:80' 7 ) 8 for proxy in proxyList: 9 proxies={"":proxy} 10 proxy_handler=urllib2.ProxyHandler(proxies) 11 opener=urllib2.build_opener(proxy_handler) 12 f=opener.open("http://www.cc98.org") 13 print f.read()
对于cookie的处理也是有Handler自动处理的:因为 HTTP 协议是一个无状态(Stateless)的协议，服务器如何知道当前请求连接的用户是否已经登陆了呢？有两种方式： 1.在URI 中显式地使用 Session ID；
2.利用 Cookie，大概过程是登陆一个网站后会在本地保留一个 Cookie，当继续浏览这个网站的时候，浏览器会把 Cookie 连同地址请求一起发送过去。
import urllib2 import cookielib cookies = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookies)) response = opener.open('http://www.google.com') for cookie in cookies: if cookie.name == 'cookie_spec': print cookie.value
处理cookie时一般是cookielib和HTTPCookieProcessor一起使用，HTTPCookieProcessor为handler。

cookielib模块定义了自动处理HTTP cookies的类，用来访问那些需要cookie数据的网站，cookielib模块包括 CookieJar，FileCookieJar，CookiePolicy，DefaultCookiePolicy，Cookie及 FileCookieJar的子类MozillaCookieJar和LWPCookieJar，CookieJar对象可以管理HTTP cookies，将cookie添加到http请求中，并能从http响应中得到cookie，FileCookieJar对象主要是从文件中读取 cookie或创建cookie，其中，MozillaCookieJar是为了创建与Mozilla浏览器cookies.txt兼容的 FileCookieJar实例，LWPCookieJar是为了创建与libwww-perl的Set-Cookie3文件格式兼容的 FileCookieJar实例，用LWPCookieJar保存的cookie文件易于人类阅读。默认的是FileCookieJar没有save函数,而MozillaCookieJar或LWPCookieJar都已经实现了。所以可以用MozillaCookieJar或LWPCookieJar，去自动实现cookie的save。

使用Basic HTTP Authentication：
1 import urllib2 2 auth_handler = urllib2.HTTPBasicAuthHandler() 3 auth_handler.add_password(realm='PDQ Application', 4 uri='https://mahler:8092/site-updates.py', 5 user='klem', 6 passwd='kadidd!ehopper') 7 opener = urllib2.build_opener(auth_handler) 8 urllib2.install_opener(opener) 9 f = urllib2.urlopen('http://www.server.com/login.html') 10 print f.read()
参考：

http://isilic.iteye.com/blog/1806403

http://www.devba.com/index.php/archives/4605.html
查看全文

相关阅读:
ubuntu11.04更改默认JDK
10个实用的jQuery交互/通信插件和教程
 jquery 使用方法
 在没有安装 ASP.NET MVC3 的服务器上运行 MVC3
固定 vs. 流动 vs. 弹性：哪种布局更适合你？[SM]
提升设计品质的8种布局方案[SM]
Ubuntu 手动安装JDK
十个简单好用的设计技巧[SM]
jQuery VSDoc下载地址
 Ubuntu 配置Apache+PHP+MySQL

原文地址：https://www.cnblogs.com/ljygoodgoodstudydaydayup/p/3861318.html