前言
以下关于Urllib的内容讲解,强烈推荐深入了解的查看官方文档。
Urllib
- Urllib是python内置的HTTP请求库,包括以下模块
- urllib.request 请求模块
- urllib.error 异常处理模块
- urllib.parse url解析模块
- urllib.robotparser robots.txt解析模块
urlopen
urlopen能用于一些简单的请求,不需要设置header信息的。
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
- 主要对url, data,timeout进行设置。看一下代码:
import urllib.request
import urllib.parse
import urllib.error
import socket
"""
1.url:就是打开的测试地址 http://httpbin.org
2.data:发送post请求必须设置的参数,通过bytes(urllib.parse.urlencode())可以将post的数据进行转换放到urllib.request.urlopen的data参数中。
3.timeout:是一个超时设置,超时则抛出异常
"""
data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf8')
try:
response = urllib.request.urlopen(url='http://httpbin.org/post', data=data, timeout=5)
print(response.read())
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print('超时...')
- urllib.request.urlopen 返回的响应 response
urlopen返回的是一个 http.client.HTTPResponse 对象:<http.client.HTTPResponse object at 0x0331E430>,response.read()获得的是响应体的内容
import urllib.request
response = urllib.request.urlopen(url='https://www.baidu.com/')
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
"""
<class 'http.client.HTTPResponse'>
200
[('Accept-Ranges', 'bytes'), ('Cache-Control', 'no-cache'), ......]
BWS/1.1
"""
request
如果需要对请求设置header信息,就需要使用request。主要是对如何增加请求头进行说明:
from urllib import request, parse
url = 'http://httpbin.org/post'
# 第一种方式,构造header字典
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36',
'Host': 'httpbin.org'
}
# 第二种方式:req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36')
dict = {
'word':'hello'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
高级用法
- 代理 ProxyHandler
某些网站对IP访问次数、频率有所限制,因此就需要随时切换IP,避免爬虫出错停止运行的情况。
import urllib.request
proxy_handler = urllib.request.ProxyHandler({
'http': 'http://127.0.0.1:9743',
'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())
- cookie HTTPCookiProcessor
cookie保存了登录信息,保存到http.cookijar,可以方便使用。更多的时候,对于难以获取cookie的网站,我们通常是使用selenium获取cookie,再通过其他高效的方式进行爬取。
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+"="+item.value)
"""
BAIDUID=68AF7F00874AE2D8206AC4B524B49EAB:FG=1
BIDUPSID=68AF7F00874AE2D8206AC4B524B49EAB
H_PS_PSSID=1451_21090_18559_29064_28519_29098_28836_28584_26350
PSTM=1558969682
delPer=0
BDSVRTM=0
BD_HOME=0
"""
异常处理
避免发生404 500异常导致的爬虫停止。
URLError,HTTPError,HTTPError是URLError的子类
- URLError:reason
- HTTPError:code,reason,headers
from urllib import request,error
try:
response = request.urlopen("http://pythonsite.com/1111.html")
except error.HTTPError as e:
print(e.reason)
print(e.code)
print(e.headers)
except error.URLError as e:
print(e.reason)
else:
print("reqeust successfully")
"""
Not Found
404
Date: Mon, 27 May 2019 15:12:43 GMT
Server: Apache
Vary: Accept-Encoding
Content-Length: 207
Connection: close
Content-Type: text/html; charset=iso-8859-1
"""
工具模块 urlparse
urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)
from urllib.parse import urlparse
o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html', scheme='https')
print(o)
print(o.scheme, o.port, o.geturl())
"""
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='')
http 80 http://www.cwi.nl:80/%7Eguido/Python.html
"""
个人博客:Loak 正 - 关注人工智能及互联网的个人博客
文章地址:Python爬虫(二)— Python3内置模块 Urllib