Python爬虫学习（1）： urllib的使用

zoukankan html css js c++ java

Python爬虫学习（1）： urllib的使用
1.urllib.urlopen

打开一个url的方法，返回一个文件对象，然后可以进行类似文件对象的操作

In [1]: import urllib

In [2]: file = urllib.urlopen("http://www.baidu.com")

In [3]: file.readline()
Out[3]: '<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="xe7x99xbexe5xbaxa6xe6x90x9cxe7xb4xa2" /><link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/><title>xe7x99xbexe5xbaxa6xe4xb8x80xe4xb8x8bxefxbcx8cxe4xbdxa0xe5xb0xb1xe7x9fxa5xe9x81x93</title> '
In [4]: file.getcode()
Out[4]: 200

urlopen返回对象提供方法：

-         read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样

-         info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息

-         getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到

-         geturl()：返回请求的url

2.urllib.urlretrieve

urlretrieve方法将url定位到的html文件下载到你本地的硬盘中。如果不指定filename，则会存为临时文件。

urlretrieve()返回一个二元组(filename,mine_hdrs)

存为本地文件:
In [12]: file = urllib.urlretrieve("http://www.baidu.com","/tmp/baidu.html")

In [13]: ls /tmp/baidu.html
/tmp/baidu.html
4.urllib.quote(url)和urllib.unquote(url)，urllib.unquote(url)和urllib.unquote_plus(url)

　　urllib.quote(url)： URL中的保留字符reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","中除了"/"之外都会被编码

　　urllib.unquote(url)：还原由quote编码的url

　　urllib.unquote(url)： URL中的所有保留字符都会被重编码　　　　
　　
In [18]: urllib.quote("http://neeao.com/index.php?id=1") Out[18]: 'http%3A//neeao.com/index.php%3Fid%3D1' In [19]: urllib.unquote("http%3A//neeao.com/index.php%3Fid%3D1") Out[19]: 'http://neeao.com/index.php?id=1' In [20]: urllib.quote_plus("http://neeao.com/index.php?id=1") Out[20]: 'http%3A%2F%2Fneeao.com%2Findex.php%3Fid%3D1' In [21]: urllib.unquote_plus("http%3A%2F%2Fneeao.com%2Findex.php%3Fid%3D1") Out[21]: 'http://neeao.com/index.php?id=1'
与4的函数相反。

5.urllib.urlencode(query)

将URL中的键值对以连接符&划分

这里可以与urlopen结合以实现post方法和get方法：

GET方法：
>>> import urllib >>> params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0}) >>> params 'eggs=2&bacon=0&spam=1' >>> f=urllib.urlopen("http://python.org/query?%s" % params) >>> print f.read()
POST方法：
>>> import urllib >>> parmas = urllib.urlencode({'spam':1,'eggs':2,'bacon':0}) >>> f=urllib.urlopen("http://python.org/query",parmas) >>> f.read()
查看全文

相关阅读:
趁热讲讲skin.xml支持的标签和attributes
如何配置和编译ogre 1.7.0 + cegui 0.7.1
关于OGRE基础教程6中CEGUI的layout文件can not locate的问题
 skin.xml皮肤配置讲解
 OCX控件注册相关(检查是否注册,注册,反注册)
重回博客园继续我的 GUI库
 窗口类的定义
 UI库需要完成的任务
 屏幕截图代码
 深入C++的默认构造函数1

原文地址：https://www.cnblogs.com/linux-wangkun/p/5947085.html

Python爬虫学习（1）： urllib的使用

1.urllib.urlopen

2.urllib.urlretrieve

4.urllib.quote(url)和urllib.unquote(url)，urllib.unquote(url)和urllib.unquote_plus(url)

5.urllib.urlencode(query)