- 提出问题:如何简单抓取一个网页的源码
- 解决方法:利用urllib库,抓取一个网页的源代码
------------------------------------------------------------------------------------
- 代码示例
#python3.4 import urllib.request response = urllib.request.urlopen("http://zzk.cnblogs.com/b") print(response.read())
- 运行结果
b' <!DOCTYPE html> <html> <head> <meta charset="utf-8"/> <title>xe6x89xbexe6x89xbexe7x9cx8b - xe5x8dx9axe5xaexa2xe5x9bxad</title> <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/> <meta content="xe6x8ax80xe6x9cxafxe6x90x9cxe7xb4xa2,ITxe6x90x9cxe7xb4xa2,xe7xa8x8bxe5xbax8fxe6x90x9cxe7xb4xa2,xe4xbbxa3xe7xa0x81xe6x90x9cxe7xb4xa2,xe7xa8x8bxe5xbax8fxe5x91x98xe6x90x9cxe7xb4xa2xe5xbcx95xe6x93x8e" name="keywords" /> <meta content="xe9x9dxa2xe5x90x91xe7xa8x8bxe5xbax8fxe5x91x98xe7x9ax84xe4xb8x93xe4xb8x9axe6x90x9cxe7xb4xa2xe5xbcx95xe6x93x8exe3x80x82xe9x81x87xe5x88xb0xe6x8ax80xe6x9cxafxe9x97xaexe9xa2x98xe6x80x8exe4xb9x88xe5x8ax9exefxbcx8cxe5x88xb0xe5x8dx9axe5xaexa2xe5x9bxadxe6x89xbexe6x89xbexe7x9cx8b..." name="description" /> <link type="text/css" href="/Content/Style.css" rel="stylesheet" /> <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script> <script src="/Scripts/Common.js" type="text/javascript"></script> <script src="/Scripts/Home.js" type="text/javascript"></script> </head> <body> <div class="top"> <div class="top_tabs"> <a href="http://www.cnblogs.com">xc2xab xe5x8dx9axe5xaexa2xe5x9bxadxe9xa6x96xe9xa1xb5 </a> </div> <div id="span_userinfo" class="top_links"> </div> </div> <div style="clear: both"> </div> <center> <div id="main"> <div class="logo_index"> <a href="http://zzk.cnblogs.com"> <img alt="xe6x89xbexe6x89xbexe7x9cx8blogo" src="/images/logo.gif" /></a> </div> <div class="index_sozone"> <div class="index_tab"> <a href="/n" onclick="return channelSwitch('n');">xe6x96xb0xe9x97xbb</a> <a class="tab_selected" href="/b" onclick="return channelSwitch('b');">xe5x8dx9axe5xaexa2</a> <a href="/k" onclick="return channelSwitch('k');">xe7x9fxa5xe8xafx86xe5xbax93</a> <a href="/q" onclick="return channelSwitch('q');">xe5x8dx9axe9x97xae</a> </div> <div class="search_block"> <div class="index_btn"> <input type="button" class="btn_so_index" onclick="Search();" value=" xe6x89xbexe4xb8x80xe4xb8x8b " /> <span class="help_link"><a target="_blank" href="/help">xe5xb8xaexe5x8axa9</a></span> </div> <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" /> </div> </div> </div> <div class="footer"> ©2004-2016 <a href="http://www.cnblogs.com">xe5x8dx9axe5xaexa2xe5x9bxad</a> </div> </center> </body> </html> '
- 附上python2.7的实现代码:
#python2.7 import urllib2 response = urllib2.urlopen("http://zzk.cnblogs.com/b") print response.read()
- 可见,python3.4和python2.7的代码存在差异性。
----------@_@? 问题出现!----------------------------------------------------------------------
- 发现问题:查看上面的运行结果,会发现中文并没有正常显示。
- 解决问题:处理中文编码问题
--------------------------------------------------------------------------------------------------
- 处理源码中的中文问题!!!
- 修改代码,如下:
#python3.4 import urllib.request response = urllib.request.urlopen("http://zzk.cnblogs.com/b") print(response.read().decode('UTF-8'))
- 运行,结果显示:
C:Python34python.exe E:/pythone_workspace/mydemo/spider/demo.py <!DOCTYPE html> <html> <head> <meta charset="utf-8"/> <title>找找看 - 博客园</title> <link rel="shortcut icon" href="/Content/Images/favicon.ico" type="image/x-icon"/> <meta content="技术搜索,IT搜索,程序搜索,代码搜索,程序员搜索引擎" name="keywords" /> <meta content="面向程序员的专业搜索引擎。遇到技术问题怎么办,到博客园找找看..." name="description" /> <link type="text/css" href="/Content/Style.css" rel="stylesheet" /> <script src="http://common.cnblogs.com/script/jquery.js" type="text/javascript"></script> <script src="/Scripts/Common.js" type="text/javascript"></script> <script src="/Scripts/Home.js" type="text/javascript"></script> </head> <body> <div class="top"> <div class="top_tabs"> <a href="http://www.cnblogs.com">« 博客园首页 </a> </div> <div id="span_userinfo" class="top_links"> </div> </div> <div style="clear: both"> </div> <center> <div id="main"> <div class="logo_index"> <a href="http://zzk.cnblogs.com"> <img alt="找找看logo" src="/images/logo.gif" /></a> </div> <div class="index_sozone"> <div class="index_tab"> <a href="/n" onclick="return channelSwitch('n');">新闻</a> <a class="tab_selected" href="/b" onclick="return channelSwitch('b');">博客</a> <a href="/k" onclick="return channelSwitch('k');">知识库</a> <a href="/q" onclick="return channelSwitch('q');">博问</a> </div> <div class="search_block"> <div class="index_btn"> <input type="button" class="btn_so_index" onclick="Search();" value=" 找一下 " /> <span class="help_link"><a target="_blank" href="/help">帮助</a></span> </div> <input type="text" onkeydown="searchEnter(event);" class="input_index" name="w" id="w" /> </div> </div> </div> <div class="footer"> ©2004-2016 <a href="http://www.cnblogs.com">博客园</a> </div> </center> </body> </html> Process finished with exit code 0
- 结果显示:处理完编码后,网页源码中中文可以正常显示了
-----------@_@! 探讨一个新的中文编码问题 ----------------------------------------------------------
问题:“如果url中出现中文,那么应该如果解决呢?”
例如:url = "http://zzk.cnblogs.com/s?w=python爬虫&t=b"
-----------------------------------------------------------------------------------------------------
- 接下来,我们来解决url中出现中文的问题!!!
(1)测试1:保留原来的格式,直接访问,不做任何处理
- 代码示例:
#python3.4 import urllib.request url="http://zzk.cnblogs.com/s?w=python爬虫&t=b" resp = urllib.request.urlopen(url) print(resp.read().decode('UTF-8'))
- 运行结果:
C:Python34python.exe E:/pythone_workspace/mydemo/spider/demo.py Traceback (most recent call last): File "E:/pythone_workspace/mydemo/spider/demo.py", line 9, in <module> response = urllib.request.urlopen(url) File "C:Python34liburllib equest.py", line 161, in urlopen return opener.open(url, data, timeout) File "C:Python34liburllib equest.py", line 463, in open response = self._open(req, data) File "C:Python34liburllib equest.py", line 481, in _open '_open', req) File "C:Python34liburllib equest.py", line 441, in _call_chain result = func(*args) File "C:Python34liburllib equest.py", line 1210, in http_open return self.do_open(http.client.HTTPConnection, req) File "C:Python34liburllib equest.py", line 1182, in do_open h.request(req.get_method(), req.selector, req.data, headers) File "C:Python34libhttpclient.py", line 1088, in request self._send_request(method, url, body, headers) File "C:Python34libhttpclient.py", line 1116, in _send_request self.putrequest(method, url, **skips) File "C:Python34libhttpclient.py", line 973, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128) Process finished with exit code 1
果然不行!!!
(2)测试2:中文单独处理
- 代码示例:
import urllib.request import urllib.parse url = "http://zzk.cnblogs.com/s?w=python"+ urllib.parse.quote("爬虫")+"&t=b" resp = urllib.request.urlopen(url) print(resp.read().decode('utf-8'))
- 运行结果:
- 结果显示:对url中的中文进行单独处理,url对应内容可以正常抓取了
------@_@! 又有一个新的问题-----------------------------------------------------------
- 问题:如果把url的中英文一起进行处理呢?还能成功抓取吗?
----------------------------------------------------------------------------------------
(3)于是,测试3出现了!测试3:url中,中英文一起进行处理
- 代码示例:
#python3.4 import urllib.request import urllib.parse url = urllib.parse.quote("http://zzk.cnblogs.com/s?w=python爬虫&t=b") resp = urllib.request.urlopen(url) print(resp.read().decode('utf-8'))
- 运行结果:
C:Python34python.exe E:/pythone_workspace/mydemo/spider/demo.py Traceback (most recent call last): File "E:/pythone_workspace/mydemo/spider/demo.py", line 21, in <module> resp = urllib.request.urlopen(url) File "C:Python34liburllib equest.py", line 161, in urlopen return opener.open(url, data, timeout) File "C:Python34liburllib equest.py", line 448, in open req = Request(fullurl, data) File "C:Python34liburllib equest.py", line 266, in __init__ self.full_url = url File "C:Python34liburllib equest.py", line 292, in full_url self._parse() File "C:Python34liburllib equest.py", line 321, in _parse raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: 'http%3A//zzk.cnblogs.com/s%3Fw%3Dpython%E7%88%AC%E8%99%AB%26t%3Db' Process finished with exit code 1
- 结果显示:ValueError!无法成功抓取网页!
- 结合测试1、2、3,可得到下面结果:
(1)在python3.4中,如果url中包含中文,可以用 urllib.parse.quote("爬虫") 进行处理。
(2)url中的中文需要单独处理,不能中英文一起处理。
- Tips:如果想了解一个函数的参数传值
#python3.4 import urllib.request
help(urllib.request.urlopen)
- 运行上面代码,控制台输出
C:Python34python.exe E:/pythone_workspace/mydemo/spider/demo.py Help on function urlopen in module urllib.request: urlopen(url, data=None, timeout=<object object at 0x00A50490>, *, cafile=None, capath=None, cadefault=False, context=None) Process finished with exit code 0