在python3中将urllib3重构后变为urllib.requesr使用,网页在抓取之后要指定decode解码。
urllib.encode变为urllib.parse.quote()编码 或者urllib.parse.unquote() 解码 汉字的编码 但这样返回的会是byte类型
若希望是字符串类型得使用:urllib.parse.urlencode()
通常为了通过服务器的检测,会更改请求头的部分数据,以伪装成浏览器来访问。
此时User-Agent,设置为浏览器模式尤为重要,尽量不要设置支持gzip压缩方式接收数据
print(response.getcode()) #显示状态码
print(response.geturl()) #显示实际链接地址
print(response.info()) #显示服务器的报头
1 import urllib.request
2
3 url="http://www.hao123.com/"
4 ua_headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}
5
6 # 构造一个待访问链接对象
7 request=urllib.request.Request(url=url,headers=ua_headers)
8 # 构造一个请求访问对象
9 response=urllib.request.urlopen(request)
10
11 html=response.read()
12
13 print(html.decode("utf-8"))
14
15 print(response.getcode(),response.geturl())
******************************************************************************************
关于User Agent的说明:
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
Mozilla/5.0:浏览器通用标识符
Windows NT 10.0; Win64; x64:操作系统的版本
AppleWebKit/537.36 (KHTML, like Gecko):浏览器的内核
Chrome/72.0.3626.109 Safari/537.36:浏览器真实的版本信息
常见的User Agent内核通常有:火狐 欧朋 谷歌
ua_list = [
"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
]
访问服务器之前设置请求头:
1 import random
2 import urllib.request
3
4 url="http://www.baidu.com/"
5 ua_list = [
6 "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
7 "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
8 "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
9 "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
10 "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
11 "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
12 ]
13
14 #在列表里随机选择一个User Agent
15 user_agent=random.choice(ua_list)
16 #构造一个请求
17 request=urllib.request.Request(url)
18 #设置或者添加一个HTTP请求头
19 request.add_header("User-Agent",user_agent)
20
21 head=request.get_header("User-agent")
22 print(head)
实现一个简单的按条件查询页面:
1 import urllib.request
2 import urllib.parse
3 import random
4
5 #目标地址
6 url="http://www.baidu.com/s"
7 #伪造客户端 http请求头
8 ua_list = [
9 "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
10 "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
11 "User-Agent: Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
12 "User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
13 "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
14 "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"
15 ]
16 #随机选择一个作为请求头
17 user_agent=random.choice(ua_list)
18 #接收查询字段
19 select=input("请输入要查询的关键字:")
20
21 #编码
22 wd={"wd":select}
23 wd=urllib.parse.urlencode(wd)
24 #拼接完整url地址
25 url=url+"?"+wd
26
27 #创建一个请求对象
28 request=urllib.request.Request(url)
29 #设置请求头的user-agent
30 request.add_header("User-Agent",user_agent)
31 #访问目标服务器
32 response=urllib.request.urlopen(request)
33 #读取并按照utf-8解码
34 html=response.read().decode("utf-8")
35
36 # print(html)
37 print(">>>>>>"+url)