urllib库,是Python内置的http请求库,不需要额外安装,包含4个模块,前三个比较常用:
- request:http请求模块,用来模拟发送请求,只需要传入url以及额外的参数,就可以模拟整个实现过程
- error:异常处理模块
- parse:用于编码、解析、合并url、参数等
- robotparser:辨别Robot协议(爬虫协议/机器人协议/网络爬虫排除标准/Robots Exclusion protocol)。
robot.txt协议通常放在根目录下,告诉爬虫和搜索引擎那些页面可以抓取,哪些不可抓取。
# robot.txt大致格式 User-agent:* Disallow:/ Allow:/public/
Request
一、利用urllib.request.urlopen请求常用两种格式
1. 用 urllib.request.urlopen( url, data, timeout....),data为byte格式
import urllib # 无参 response = urllib.request.urlopen("https://www.python.org") # post传字节流 data = bytes(urllib.parse.urlencode({"word": "test"}), encoding="utf8") response2 = urllib.request.urlopen("http://httpbin.org/post", data= data, timeout=3)
2. 通过urllib.request.Request(url, data, headers, method)类构造请求参数,data为byte字节流,headers为字典类型
import urllib url = "http://httpbin.org/post" headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE5.5; Windows NT)', 'Host': 'httpbin.org' } dict = { 'name': 'tester' } data = bytes(urllib.parse.urlencode(dict), encoding = "utf8") request = urllib.request.Request(url= url, data=data, headers= headers, method= "POST") response = urllib.request.urlopen(request) print(response.read().decode("utf-8"))
二、Handler 处理登录、Cookies、代理等
urllib.request.BaseHandler 类是所有Handler的父类 | |
HTTPDefaultErrorHandler | 用于处理HTTP响应错误时抛出的HTTPError类型的异常 |
HTTPRedirectHandler | 用于处理重定向 |
HTTPCookieProcessor | 用来处理Cookies |
ProxyHandler | 用于设置代理,默认代理为空 |
HTTPPasswordMgr | 用于密码管理,它会维护一个用户名和密码表 |
HTTPBasicAuthHandler | 用于管理认证,连接打开时需要认证是,用此解决 |
借助代理完成请求的过程:利用urllib.request中各种Handler构建Opener,在用Opener.open(url)去请求。
#构建过程: 1. handler 2. opener = urllib.request.build_opener(handler) 3. opener.open(url)
Opener:OpenerDirector类,urlopen()这个方法也是一个urllib提供的一个简单的Opener,Opener的高级用法可完成更深一层、更高级的功能配置。
#实例:输入密码才能进入测试网页,需要借助HTTPBasicAuthHandler完成 from urllib.request import HTTPPasswordMgrWithDefaulRealm, HTTPBasicAuthHandler, build_opener from urllib.error import URLError username = "tester" password = "testerpw" url = "http://km******.test.mararun.com/" pwMsg = HTTPPasswordMgrWithDefaultRealm() pwMsg.addPassword(None, url, username, password) handlerAuth = HTTPBasicAuthHandler(pwMsg) opener = build_opener(handlerAuth) try: response = opener.open(url) print(response.read().decode('utf-8')) except URLError as e: print(e.reason)
''' Cookie处理:声明一个http.CookieJar对象,利用HTTPCookieProcessor来构建一个Handler,最后利用build_opener()方法构建Opener,执行open()方法请求 ''' import http.cookiejar, urllib.request # 循环打印cookie的key-value cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") for item in cookie: print(item.name+"="+item.value) # 以文本格式保存cookie数据 filename= "cookies.txt" # 两种格式 # cookie = http.cookiejar.MozilaCookieJar(filename) cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") cookie.save(ignore_discard=True, ignore_expires=True) # 读取并利用cookies,以LWPCookieJar格式为例 cookie = http.cookiejar.LWPCookieJar() cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open("http://www.baidu.com") print(response.status)
Error
URLError类继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过这个类捕获,有个reason属性可返回错误原因。
HTTPError是URLError的子类,用来专门处理HTTP请求错误,有三个属性 code状态码、reason错误原因、headers请求头。
# 实例1 reason为字符串 from urllib import request, error try: response = request.urlopen("http://testerror.com/index.html") except error.HTTPError as e: print(e.code, e.reason, e.headers) except error.URLError as e: print(e.reason) else: print("Request Successfully") # 实例2 reason为一个对象,比如请求超时,返回的是一个socket.timeout类,可以用isinstance()来判断它的类型 import socket import urllib.request import urllib.error try: response = urllib.request.urlopen("https://www.baidu.com", timeout=0.01) except urllib.error.URLError as e: print(type(e.reason)) if isinstance(e.reason, socket.timeout): print("Time Out!")
Parse:url.parse.
1. urlparse() 解析url的6个部分,返回<class 'urllib.parse.ParseResult'>类型,ParseResult(scheme, netloc, path, params, query, fragment)
2. urlunparse() 拼接url,仅接收长度为6的可迭代对象
3. urlsplit() 解析url的5个部分,返回<class 'urllib.parse.SplitResult'>类型,SplitResult(scheme, netloc, path, query, fragment),不单独解析params,合并到path里面
4. urlunsplit() 拼接url,仅接收长度为5的迭代对象
5. urljoin() 拼接字符串,解析baseurl的scheme、netloc、path对后面链接缺失的部分进行补充
6. urlencode() 序列化参数 dict->url参数
7. parse_qs() 反序列化 url参数->dict
8. parse_qs() 反序列化 url参数->list[元组]
9. quote() 将内容转化为url编码格式,防止带中文时出现乱码的问题
10. unquote() 进行url解码
# 1. urlparse from urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html;user?id=5#comment') print(type(result), result) ''' <class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment') ''' # urlparse scheme默认协议参数、allow_fragments为False时fragment被解析为path、params、query最近那一个的一部分 result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https', allow_fragments=False) ''' 元组ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='') '''
# 2. urlunparse() from urllib.parse import urlunparse data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment'] print(urlunparse(data)) ''' http://www.baidu.com/index.html;user?a=6#comment '''
# 3. urlsplit() from urllib.parse import urlsplit result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment') print(result) ''' 元组SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment') '''
# 4. urlunsplit() from urllib.parse import urlunsplit data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment'] print(urlunsplit(data)) ''' http://www.baidu.com/index.html?a=6#comment '''
# 5. unjoin() from urllib.parse import urljoin print(urljoin("http://www.baidu.com/about.html?wd=abc", "http://test/index.php")) ''' http://test/index.php '''
# 6. urlencode() from urllib.parse import urlencode params = { "name": "test", "age": 30 } base_url = "http://www.baidu.com?" url = base_url + urlencode(params) print(url) ''' http://www.baidu.com?name=test&age=30 '''
# 7. parse_qs from urllib.parse import parse_qs query = "name = test&age=22" print(parse_qs(query)) ''' {'name ': [' test'], 'age': ['22'] '''
# 8. parse_qsl() from urllib.parse import parse_qsl query = "name = test&age=22" print(parse_qsl(query)) ''' [('name ', ' test'), ('age', '22')] '''
# 9. quote() from urllib.parse import quote keyword = "测试" url = "https://www.baidu.com/s?wd=" + quote(keyword) print(url) ''' https://www.baidu.com/s?wd=%E6%B5%8B%E8%AF%95 '''
# unquote() from urllib.parse import unquote url = "https://www.baidu.com/s?wd=%E6%B5%8B%E8%AF%95" print(unquote(url)) ''' https://www.baidu.com/s?wd=测试 '''
参考:静觅 » [Python3网络爬虫开发实战]