zoukankan      html  css  js  c++  java
  • python urllib、urlparse、urllib2、cookielib

    1、urllib模块

    1.urllib.urlopen(url[,data[,proxies]])

    打开一个url的方法,返回一个文件对象,然后可以进行类似文件对象的操作。本例试着打开google

    import urllib
    
    f = urllib.urlopen('http://www.google.com.hk/')
    firstLine = f.readline()   #读取html页面的第一行
    

    urlopen返回对象提供方法:

    -         read([bytes]):读所以字节或者bytes个字节

    -         readline():读一行

    -         readlines() :读所有行

    -         fileno() :返回文件句柄

    -         close() :关闭url链接

    -         info():返回一个httplib.HTTPMessage对象,表示远程服务器返回的头信息

    -         getcode():返回Http状态码。如果是http请求,200请求成功完成;404网址未找到

    -         geturl():返回请求的url

    2.urllib.urlretrieve(url[,filename[,reporthook[,data]]])

    urlretrieve方法将url定位到的html文件下载到你本地的硬盘中。如果不指定filename,则会存为临时文件。

    urlretrieve()返回一个二元组(filename,mine_hdrs)

    临时存放:

    filename = urllib.urlretrieve('http://www.google.com.hk/')
    
    type(filename)
    <type 'tuple'>
    
    print filename[0]
    print filename[1]
    

    输出:

    '/tmp/tmp8eVLjq'
    
    <httplib.HTTPMessage instance at 0xb6a363ec>
    

    存为本地文件:

    filename = urllib.urlretrieve('http://www.baidu.com/',filename='/home/dzhwen/python文件/Homework/urllib/google.html')
    
    print type(filename)
    print filename[0]
    print filename[1]
    

    输出:

    <type 'tuple'>
    '/home/dzhwen/pythonxe6x96x87xe4xbbxb6/Homework/urllib/google.html'
    <httplib.HTTPMessage instance at 0xb6e2c38c>
    

    reporthook参数使用如下:

    def process(blk,blk_size,total_size):
    	print('%d/%d - %.02f%%' %(blk*blk_size,total_size,(float)(blk * blk_size) / total_size * 100))
    
    def download():
    	filename,fileinfo = urllib.urlretrieve('http://cnblogs.com','index.html',reporthook=process)
    

     输出结果:

    0/46164 - 0.00%
    8192/46164 - 17.75%
    16384/46164 - 35.49%
    24576/46164 - 53.24%
    32768/46164 - 70.98%
    40960/46164 - 88.73%
    49152/46164 - 106.47%
    

    blk * blk_size的有可能超过total_size,如上函数可以改写为:

    def process(blk,blk_size,total_size):
    	if total_size == -1:
    		print "can't determine the file size, now retrived", blk * blk_size
    	else:
    		percentage = int((blk * blk_size * 100.0) / total_size)
    		if percentage >= 100:
    			print('%d/%d - %d%%' % (total_size, total_size, 100))
    		else:
    			print('%d/%d - %d%%' % (blk * blk_size, total_size, percentage))
    

     运行后输出:

    0/46238 - 0%
    8192/46238 - 17%
    16384/46238 - 35%
    24576/46238 - 53%
    32768/46238 - 70%
    40960/46238 - 88%
    46238/46238 - 100%
    

    3.urllib.urlcleanup()

    清除由于urllib.urlretrieve()所产生的缓存

    4.urllib.quote(url)和urllib.quote_plus(url)

    将url数据获取之后,并将其编码,从而适用与URL字符串中,使其能被打印和被web服务器接受。

    urllib.quote('http://www.baidu.com')
    

    转换结果:

    'http%3A//www.baidu.com'
    urllib.quote_plus('http://www.baidu.com')
    

    转换结果:

    'http%3A%2F%2Fwww.baidu.com'

    5.urllib.unquote(url)和urllib.unquote_plus(url)

    与4的函数相反。

    6.urllib.urlencode(query)

    将URL中的键值对以连接符&划分

    这里可以与urlopen结合以实现post方法和get方法:

    GET方法:

    import urllib
    
    params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0})
    f=urllib.urlopen("http://python.org/query?%s" % params)
    print f.read()
    

    POST方法:  

    import urllib
    
    parmas = urllib.urlencode({'spam':1,'eggs':2,'bacon':0})
    f=urllib.urlopen("http://python.org/query",parmas)
    f.read()
    

    2.urlparse模块

    1.urlparse

    作用:反向解析url

    def parse_html():
    	url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980'
    	result = urlparse.urlparse(url)
    	# params = urlparse.parse_qs(result.query)
    	print result
    	# print params
    

    运行结果:

    ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980', fragment='')
    

     如上返回的是一个parseResult对象,其中包括协议类型、主机地址、路径、参数以及query

    2.parse_qs

    import urllib
    import urlparse
    
    def parse_html():
    	url = 'https://www.baidu.com/s?wd=python&rsv_spt=1&rsv_iqid=0xad2dc5550032146a&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=7&rsv_sug1=5&rsv_sug7=100&rsv_sug2=0&inputT=22&rsv_sug4=4980'
    	result = urlparse.urlparse(url)
    	params = urlparse.parse_qs(result.query)
    	# print result
    	print params
    
    if __name__ == '__main__':
    	# demo()
    	# demo2()
    	parse_html()
    

     运行结果:

    {'wd': ['python'], 'rsv_spt': ['1'], 'rsv_iqid': ['0xad2dc5550032146a'], 'inputT': ['22'], 'f': ['8'], 'rsv_enter': ['1'], 'rsv_bp': ['0'], 'rsv_idx': ['2'], 'tn': ['baiduhome_pg'], 'rsv_sug4': ['4980'], 'rsv_sug7': ['100'], 'rsv_sug1': ['5'], 'issp': ['1'], 'rsv_sug3': ['7'], 'rsv_sug2': ['0'], 'ie': ['utf-8']}
    

    3、urllib2模块

    urllib2提供更加强大的功能,如cookie的管理,但并不能完全代替urllib,因为urllib.urlencode函数urllib2中是没有的

    3.1 urllib2.urlopen()

    作用:打开url

    参数:

    • url
    • data = None
    • timeout = <object>
    import urllib
    import urllib2
    
    def demo():
    	url = 'http://www.cnblogs.com/hester/sllsl'
    	try:
    		s = urllib2.urlopen(url,timeout = 3)
    	except urllib2.HTTPError,e:
    		print e
    	else:
    		print s.read(100)
    
    if __name__ == '__main__':
    	demo()
     

    运行结果:

    <!DOCTYPE html>
    <html lang="zh-cn">
    <head>
    <meta charset="utf-8"/>
    <title>”温故而知新“ 
    

    如果url更改为未知的网址:

    url = 'http://www.cnblogs.com/hester/asdfas'
    

     运行结果:

    HTTP Error 404: Not Found
    

    3.2 urllib2.Request()

    作用:添加或者修改http头

    参数:

    • url
    • data
    • headers
    import urllib
    import urllib2
    
    def demo():
    	url = 'http://www.cnblogs.com/hester'
    	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}
    	req = urllib2.Request(url,headers=headers)
    	s = urllib2.urlopen(req)
    	print s.read(100)
    	print req.headers
    	s.close()
    
    if __name__ == '__main__':
    	demo()
    

     运行结果:

    <!DOCTYPE html>
    <html lang="zh-cn">
    <head>
    <meta charset="utf-8"/>
    <title>”温故而知新“ 
    {'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}
    

     3.3 urllib2.bulid_opener()

    作用:创建一个打开器

    参数:

    • Handler列表
    1. ProxyHandler
    2. UnknownHandler
    3. HTTPHandler
    4. HTTPDefaultHandler
    5. HTTPRedirectHandler
    6. FTPHandler
    7. FileHandler
    8. HTTPErrorHandler
    9. HTTPSHandler

    返回:

    • OpenerDirector
    import urllib
    import urllib2
    
    def request_post_debug():
    	data = {'username':'hester_ge','password':'xxxxxxx'}
    	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}
    	req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers)
    	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
    	s = opener.open(req)
    	print s.read(100)
    	s.close()
    
    if __name__ == '__main__':
    	request_post_debug()
    

    运行结果:

    send: 'POST /hester HTTP/1.1
    Accept-Encoding: identity
    Content-Length: 35
    Host: www.cnblogs.com
    X-My-Hester: my value
    User-Agent: Mozilla/5.0
    Connection: close
    Content-Type: application/x-www-form-urlencoded
    
    username=hester_ge&password=xxxxxxx'
    reply: 'HTTP/1.1 200 OK
    '
    header: Date: Sun, 03 Jul 2016 08:28:37 GMT
    header: Content-Type: text/html; charset=utf-8
    header: Content-Length: 14096
    header: Connection: close
    header: Vary: Accept-Encoding
    header: Cache-Control: private, max-age=10
    header: Expires: Sun, 03 Jul 2016 08:28:45 GMT
    header: Last-Modified: Sun, 03 Jul 2016 08:28:35 GMT
    header: X-UA-Compatible: IE=10
    <!DOCTYPE html>
    <html lang="zh-cn">
    <head>
    <meta charset="utf-8"/>
    <title>”温故而知新“ 
    

     3.4 urllib2.install_opener

    作用:保存创建的opener

    import urllib
    import urllib2
    
    def demo():
    	url = 'http://www.cnblogs.com/hester'
    	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}
    	req = urllib2.Request(url,headers=headers)
    	s = urllib2.urlopen(req)
    	print s.read(100)
    	print req.headers
    	s.close()
    
    # def request_post_debug():
    # 	data = {'username':'hester_ge','password':'xxxxxxx'}
    # 	headers = {'User-Agent':'Mozilla/5.0','x-my-hester':'my value'}
    # 	req = urllib2.Request('http://www.cnblogs.com/hester',data = urllib.urlencode(data),headers=headers)
    # 	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
    # 	s = opener.open(req)
    # 	print s.read(100)
    # 	s.close()
    
    def install_opener():
    	opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1),
    								  urllib2.HTTPSHandler(debuglevel=1))
    	urllib2.install_opener(opener)
    
    if __name__ == '__main__':
    	# request_post_debug()
    	demo()
    

     运行结果:

    <!DOCTYPE html>
    <html lang="zh-cn">
    <head>
    <meta charset="utf-8"/>
    <title>”温故而知新“ 
    {'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}
    

     如上代码更改为:

    if __name__ == '__main__':
    	# request_post_debug()
    	install_opener()
    	demo()
    

    运行结果:

    send: 'GET /hester HTTP/1.1
    Accept-Encoding: identity
    Host: www.cnblogs.com
    Connection: close
    X-My-Hester: my value
    User-Agent: Mozilla/5.0
    
    '
    reply: 'HTTP/1.1 200 OK
    '
    header: Date: Sun, 03 Jul 2016 08:39:31 GMT
    header: Content-Type: text/html; charset=utf-8
    header: Content-Length: 14096
    header: Connection: close
    header: Vary: Accept-Encoding
    header: Cache-Control: private, max-age=10
    header: Expires: Sun, 03 Jul 2016 08:39:41 GMT
    header: Last-Modified: Sun, 03 Jul 2016 08:39:31 GMT
    header: X-UA-Compatible: IE=10
    <!DOCTYPE html>
    <html lang="zh-cn">
    <head>
    <meta charset="utf-8"/>
    <title>”温故而知新“ 
    {'X-my-hester': 'my value', 'User-agent': 'Mozilla/5.0'}
    

     4、cookies模块

    因HTTP协议是无状态的,服务器无法识别请求是否为同一计算机,所以需要使用cookies进行标示。

    客户见浏览器先发送request给服务器,服务器收到请求后进行解析,然后发送response给客户机,set_cookies就存在与response中,由浏览器进行设置。

    我们这边用到两个模块

    cookielib.CookieJar 提供解析并保存cookie的接口

    HTTPCookieProcessor 提供自动出来cookie的功能

    #encoding=utf8
    import urllib2
    import cookielib
    
    def handler_cookie():
    	cookiejar = cookielib.CookieJar()
    	handler = urllib2.HTTPCookieProcessor(cookiejar=cookiejar)
    	opener = urllib2.build_opener(handler,urllib2.HTTPHandler(debuglevel=1))
    	s = opener.open('http://www.douban.com/')
    	print s.read(100)
    	s.close()
    
    	print '=' * 20
    	print cookiejar._cookies
    	print '=' * 20
    
    	#发送第二次请求时,自动带上cookie
    	s2 = opener.open('http://www.douban.com/')
    	print s2.read(100)
    	s2.close()
    
    if __name__ == '__main__':
    	handler_cookie()
    

    运行结果:

    /usr/bin/python2.7 /home/hester/PycharmProjects/untitled/demo4.py
    send: 'GET / HTTP/1.1
    Accept-Encoding: identity
    Host: www.douban.com
    Connection: close
    User-Agent: Python-urllib/2.7
    
    '
    reply: 'HTTP/1.1 301 Moved Permanently
    '
    header: Date: Sun, 03 Jul 2016 10:01:41 GMT
    header: Content-Type: text/html
    header: Content-Length: 178
    header: Connection: close
    header: Location: https://www.douban.com/
    header: Server: dae
    <!DOCTYPE HTML>
    <html lang="zh-cms-Hans" class="">
    <head>
    <meta charset="UTF-8">
    <meta name="descrip
    ====================
    {'.douban.com': {'/': {'ll': Cookie(version=0, name='ll', value='"118163"', port=None, port_specified=False, domain='.douban.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1499076101, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), 'bid': Cookie(version=0, name='bid', value='dDz4rCqWvcQ', port=None, port_specified=False, domain='.douban.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1499076101, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)}}}
    ====================
    send: 'GET / HTTP/1.1
    Accept-Encoding: identity
    Host: www.douban.com
    Cookie: ll="118163"; bid=dDz4rCqWvcQ
    Connection: close
    User-Agent: Python-urllib/2.7
    
    '
    reply: 'HTTP/1.1 301 Moved Permanently
    '
    header: Date: Sun, 03 Jul 2016 10:01:42 GMT
    header: Content-Type: text/html
    header: Content-Length: 178
    header: Connection: close
    header: Location: https://www.douban.com/
    header: Server: dae
    <!DOCTYPE HTML>
    <html lang="zh-cms-Hans" class="">
    <head>
    <meta charset="UTF-8">
    <meta name="descrip
    
    Process finished with exit code 0
    




      

      

      

      

      

      

  • 相关阅读:
    Git 处理tag和branch的命令
    手把手教您使用第三方登录
    iOS 中隐藏UITableView最后一条分隔线
    Android简易实战教程--第四十四话《ScrollView和HorizontalScrollView简单使用》
    iOS-改变UITextField的Placeholder颜色的三种方式
    react-native 关闭黄屏警告
    reactnative js onclick 模拟单击/双击事件
    reactnative 监听屏幕方向变化
    reactnative0.61.2 使用react-native-webrtc
    use react-navigation@2.18.2
  • 原文地址:https://www.cnblogs.com/hester/p/5420696.html
Copyright © 2011-2022 走看看