《用python写网络爬虫》编写第一个网络爬虫

zoukankan html css js c++ java

《用python写网络爬虫》编写第一个网络爬虫
为了抓取网站，我们首先需要下载包含有感兴趣数据的网页，该过程一般被称为爬取“crawing”。爬取一个网站有很多种方法，而选用哪种方法更加合适，则取决于目标网站的结构。本章中，首先会探讨如何安全地下载网页，然后会介绍如下3种爬取网站的常见方法：
- 爬取网站地图
- 遍历每个网页的数据库ID
- 跟踪网页链接
下载网页

　　想要爬取网页，我们首先需要将其下载下来。下面的示例脚本使用python的urllib2模块下载URL。
import　　urllib2 def　　download(url): 　　return　　urllib2.urlopen(url).read()
当传入URL参数时，该函数将会下载网页并返回其HTML。不过，这个代码存在一个问题，即当下在网页时，我们可能会遇到一些无法控制的错误，比如请求的页面可能不存在。此时，urllib2会抛出异常，然后退出脚本。安全起见，下面再给出一个更健壮的版本，可以捕获这些异常。
import urllib2 def download(url) print("Download: ",url) try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print("Download error: ",e.reason) html = None return html
现在，当出现下载错误时，该函数能够捕获到异常，然后返回None。

重新下载

　　下载时遇到的错误经常是临时性的，比如服务器过载时返回的 503 service unavailable 错误。对于此类错误，我们可以尝试重新下载，因为这个服务器问题现在可能已解决。不过，我们不需要对所有错误都尝试重新下载。如果服务器返回的是 404 not found 这种错误，则说明该网页目前并不存在，再次尝试同样的请求一般也不会出现不同的结果。

　　互联网工程任务组（Inter Engineering Task Force）定义了HTTP错误的完整列表，详情可参考 http://tools.ietf.org/html/rfc7231#section-6 从该文档中，我们可以了解到 4XX 错误发生在请求存在问题时，而 5XX 错误发生在服务端存在问题时。所以，我们只需要确保 download 函数在发生 5XX 错误时重试下载即可。下面是支持重试下载功能的新版本代码。
#python2 import urllib2 def download(url, num_retries=2): print 'Download:',url try: html = urllib2.urlopen(url).read() except urllib2.URLError as e: print('Download error: ',e.reason) html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code <600: #hasattr() 函数用于判断对象是否包含对应的属性。 # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html

#maybe made in Richard lawson
#python3 import os import urllib.request import urllib.error def download(url, num_retries=2): print ("Download:",url) try: html = urllib.request.urlopen(url).read() except urllib.error.URLError as e: print("Download error: ",e.reason) html = None if num_retries > 0 and 500 <= e.code <600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html print(download("http://httpstat.us/500")) os.system("pause") #made in China
现在，当download函数遇到 5XX 错误码时，将会递归调用函数自身进行尝试。此外，该函数还增加了一个参数，用于设定重试下载的次数，其默认值为两次。我们在这里限制网页下载的尝试次数，是因为服务器错误可能暂时还没有解决。想要测试该函数，可以尝试下载 http://httpstat.us/500 ,该网站会始终返回500错误码。

将会显示类似如下的文本
Download: http://httpstat.us/500 Download error: Internal Server Error Download: http://httpstat.us/500 Download error: Internal Server Error Download: http://httpstat.us/500 Download error: Internal Server Error None 请按任意键继续. . .
从上面的返回结果可以看出，download 函数的行为和预期一致，先尝试下载网页，在接收到500错误后，又进行了两次重试才放弃。

设置用户代理

默认情况下，urllib2 使用Python-urllib/2.7 作为用户代理下载网页内容，其中2.7是Python的版本号。如果能使用可辨识的用户代理则更好，这样可以避免我们的网络爬虫碰到一些问题。此外，也许是因为曾经经历过质量不佳的Python网络爬虫造成的服务器过载，一些网站还会封禁这个默认的用户代理。比如，在使用python默认用户代理的情况下，访问 http://meetup.com/ ,目前会返回如下访问拒绝提示。
Access denied The owner of this website(www.meetup.com) has banned your access based on your brower's signature (1754134676cf0ac4-ua48). Ray ID: 1754134676cf0ac4 Timestamp:Mon,06-Oct-14 18:55:48 GMT Your IP address: Requested URL:www.meetup.com/ Error reference number: 1010 Server ID: FL_33F7 User-Agent: Python-urllib/2.7
因此，为了下载更加可靠，我们需要控制用户代理的设定。下面的代码对 download 函数进行了修改，设定了一个默认的用户代理“wswp”（即web scraping with python 的首字母缩写）。
def download(url, user_agent='wswp', num_retries=2): print 'Download:', url headers = {'User-agent':user_agent} request = urllib2.Request{url, headers=headers} try: html = urllib2.urlopen(request).read() except urllib.URLError as e: print 'Download error:', e.reason html = None if num_retries > 0: if hasattr(e, 'code') and 500 <= e.code <600: # recursively retry 5xx HTTP errors return download(url, num_retries-1) return html
现在，我们拥有了一个灵活的下载函数，可以在后续示例中得到复用。该函数能够捕获异常、重试下载并设置用户代理。
查看全文

相关阅读:
2. Add Two Numbers
1. Two Sum
22. Generate Parentheses (backTracking)
21. Merge Two Sorted Lists
20. Valid Parentheses (Stack)
19. Remove Nth Node From End of List
18. 4Sum (通用算法 nSum)
17. Letter Combinations of a Phone Number (backtracking)
LeetCode SQL: Combine Two Tables
LeetCode SQL:Employees Earning More Than Their Managers

原文地址：https://www.cnblogs.com/callmebg/p/9324303.html

《用python写网络爬虫》 编写第一个网络爬虫

《用python写网络爬虫》编写第一个网络爬虫