urllib库是python内置的HTTP请求库,它包含如下几个模块:
- urllib.request 请求模块
- urllib.error 异常处理模块
- urllib.parse URL解析模块
- urllib.robotparser robots.txt解析模块
1. urllib.request
1.1 urlopen函数
1)urlopen函数的返回类型
- urlopen函数返回的是一个bytes类型的数据,通过read()函数读取内容之后再进行decode转码后才能查看。
# urlopen函数 import urllib.request response = urllib.request.urlopen('http://www.baidu.com') # response是一个bytes类型的数据,所以还需要转码成utf-8 print(response.read().decode('utf-8'))
2)urlopen函数的参数
- urlopen函数的第一个参数是url,第二个参数是请求附加的数据,第三个参数是超时时间
- 如果urlopen函数传了第二个参数,则表示以POST方式提交请求,且第二个参数要用bytes类型来传入
import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8') # 如果urlopen函数传了第二个参数,则表示以POST方式提交请求,第二个参数要用bytes类型来传入 # urlopen的第三个参数表示超时时间,若超过这个时间还没有得到响应,则会抛出异常 response = urllib.request.urlopen('http://httpbin.org/post', data=data, timeout=2)
print(response.read().decode('utf-8'))
- 超时时间:
import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason, socket.timeout): print("超时")
1.2 响应
# 响应 import urllib.request response = urllib.request.urlopen('http://cnblogs.com/hgzero') print(type(response)) print(response.status) # 状态码 print(response.getheaders()) # 响应头,得到的是一个个元组组成的列表 print(response.getheader('Content-Type')) # 注意这里的getheader没有加s
1.3 请求
# 请求 import urllib.request request = urllib.request.Request('http://cnblogs.com/hgzero') response = urllib.request.urlopen(request) # 将Request对象当做一个参数传给urlopen print(response.read().decode('utf-8'))
- 在请求中自定义http头和data数据:
# 第一种,自己构造一个包含自定义headers和data的Request对象,再将Request对象传入urlopen函数 from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent': 'Mozilia/4.0 (compatible; MSIE 5.5;Windows NT)', 'Host': 'httpbin.org' } dict = { 'name': 'Germey' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, headers=headers, method='POST') response = request.urlopen(req) print(response.read().decode('utf-8'))
# 第二种,通过调用Request对象的add_header方法来添加http头 from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name': 'Germey' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, method='POST') req.add_header('User-Agent', 'Mozilia/4.0 (compatible; MSIE 5.5;Windows NT)') # 添加一个http头 response = request.urlopen(req) print(response.read().decode('utf-8'))
1.4 代理
import urllib.request proxy_handler = urllib.request.ProxyHandler( { 'http': 'http://127.0.0.1:25379', # 'https': 'https://127.0.0.1:25379' } ) opener = urllib.request.build_opener(proxy_handler) response = opener.open('http://www.youtobe.com/') print(response.read().decode('utf-8'))
1.5 Cookie
import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for item in cookie: print(item.name+"="+item.value)
- cookie的保存和读取:
import http.cookiejar, urllib.request # MozillaCookieJar的cookie保存格式 filename = "cookie_first.txt" cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) # LWPCookieJar的cookie保存格式,两种保存格式随便选一种即可 filename = "cookie_second.txt" cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) # 下次请求时再读取保存的cookie cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie_second.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))
2. urllib.error
from urllib import request, error try: response = request.urlopen('http://hgzerowzhpray.com') except error.HTTPError as e: # 这个错误范围较小 print(e.reason, e.code, e.headers, sep=' ') except error.URLError as e: # 这个错误范围较大 print(e.reason) else: print('Request Successfully')
3. urllib.parse
4. urllib.robotparser
- 用的不多,直接忽略。