【Python3网络爬虫】1-urllib库的使用
内置模块介绍
Python内置的HTTP请求库,包含四个模块
-
error
异常处理模块,如果出现请求错误,我们可以捕获这些异常,然后进行重试或其他操作以保证程序不会意外终止。 -
parse
一个工具模块,提供了许多URL处理方法,比如拆分、解析、合并等。 -
request
它是最基本的HTTP请求模块,可以用来模拟发送请求。就像在浏览器里输入网址然后回车一样,只需要给库方法传入URL以及额外的参数,就可以模拟实现这个过程了。 -
response
最基本的HTTP响应模块 -
robotparser
主要是用来识别网站的robots.txt文件,然后判断哪些网站可以爬,哪些网站不可以爬,它其实用得比较少。
请求头的设置
from urllib.request import urlopen
from urllib.request import Request
url = "http://www.baidu.com"
# 当不设置User-Agent,容易被识别
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/78.0.3904.97 Mobile Safari/537.36 '
}
request = Request(url, headers=headers)
response = urlopen(request)
info = response.read()
# 注意,这里只能是User-agent
print(request.get_header('User-agent'))
print(info)
请求头,利用fake_useragent获取UserAgent
from fake_useragent import UserAgent
ua = UserAgent()
print(ua.chrome)
print(ua.opera)
print(ua.firefox)
代理设置
代理网站
https://www.kuaidaili.com/free/
https://www.xicidaili.com/nt/
from urllib.request import Request
from urllib.request import build_opener
from urllib.request import ProxyHandler
from fake_useragent import UserAgent
url = "http://httpbin.org/get"
headers = {
'User-Agent': UserAgent().chrome
}
request = Request(url, headers=headers)
handler = ProxyHandler({
"http": "112.95.23.90:8888"
})
opener = build_opener(handler)
response = opener.open(request)
print(response.read().decode())
cookie的设置
from urllib.request import HTTPCookieProcessor
from urllib.request import build_opener
from urllib.request import Request
from http.cookiejar import MozillaCookieJar
from fake_useragent import UserAgent
# 保存cookie
def get_cookie():
url = "http://baidu.com"
headers = {
'User-Agent': UserAgent().chrome
}
request = Request(url, headers=headers)
cookie_jar = MozillaCookieJar()
handler = HTTPCookieProcessor(cookie_jar)
opener = build_opener(handler)
response = opener.open(request)
cookie_jar.save("cookie.txt", ignore_expires=True, ignore_discard=True)
if __name__ == '__main__':
get_cookie()
异常处理URLError
from urllib.request import Request, urlopen, URLError
from fake_useragent import UserAgent
url = "https://missj.top/adda"
headers = {
'User-Agent': UserAgent().chrome
}
try:
req = Request(url, headers=headers)
resp = urlopen(req)
print(resp.read().decode())
except URLError as e:
if e.args == ():
print(e.code)
else:
print(e.args[0].errno)