>>> from urllib.parse import urlparse >>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') >>> o ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='') >>> o.scheme 'http' >>> o.port 80 >>> o.geturl() 'http://www.cwi.nl:80/%7Eguido/Python.html'
>>> from urllib.parse import urlparse >>> urlparse('//www.cwi.nl:80/%7Eguido/Python.html') ParseResult(scheme='', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html', params='', query='', fragment='') >>> urlparse('www.cwi.nl/%7Eguido/Python.html') ParseResult(scheme='', netloc='', path='www.cwi.nl/%7Eguido/Python.html', params='', query='', fragment='') >>> urlparse('help/Python.html') ParseResult(scheme='', netloc='', path='help/Python.html', params='', query='', fragment='')
Attribute |
Index |
Value |
Value if not present |
---|---|---|---|
|
0 |
URL scheme specifier |
scheme parameter |
|
1 |
Network location part |
empty string |
|
2 |
Hierarchical path |
empty string |
|
3 |
Parameters for last path element |
empty string |
|
4 |
Query component |
empty string |
|
5 |
Fragment identifier |
empty string |
|
User name |
||
|
Password |
||
|
Host name (lower case) |
||
|
Port number as integer, if present |
>>>from urllib.parse import urljoin >>>urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
'http://www.cwi.nl/%7Eguido/FAQ.html'
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', ... '//www.python.org/%7Eguido') 'http://www.python.org/%7Eguido'
>>>urllib.request.quote('http://www.baidu.com') 'http%3A//www.baidu.com'
>>>urllib.request.unquote('http%3A//www.baidu.com') 'http://www.baidu.com'
简单的demo示例
思路如下:
- 爬取一个网页并将爬取到的内容读取出来赋给一个变量。
- 以写入的方式打开一个本地文件,命名为*.html等网页格式。
- 将步骤1中的变量写入该文件中。
- 关闭该文件
import urllib.request import urllib.parse url='http://www.baidu.com' hearder={ 'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' } request=urllib.request.Request(url,headers=header) reponse=urllib.request.urlopen(request).read() h=open("./1.html","wb") h.write(reponse) h.close()
参考:https://docs.python.org/3/library/urllib.parse.html?highlight=urlparse#urllib.parse.urlparse