爬虫总结_1_
爬虫小点:
1.网站背景调查
robots.txt, Sitemp 文件
以检查网站 构建的技术类型一-builtwith 模块
> import builtwith
> builtwith . parse ( 'http:/ /example . webscraping . com ''}
--> 返回网站的所需技术
pip install python-whois
下面是使用该模块对appspot.com 这个域名进行WHOIS 查询时的返回
结果
以使用WHOIS协议查询域名的注册者是谁
Crawling 爬取:
# 下载网页
import urllib2
def download(url):
return urllib2.urlopen(url).read()
--> 函数将会下载网页并返回其HTML
优化:
import urllinb2
def download(url):
print('dwnload :' url)
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print(e.reason)
html = None
return html
下载错误:
1.重试下载:
如: 下载时遇到的错误经常是临时性的,服务器 过载时回 503
Ser vice Unavailable错误
服务器返回的是404 Not Found 这种错误, 则说明该网页目
前并不存在,
--》响应码: https: // tools.ietf.org/html/rfc7231#
section- 6
import urllinb2
def download(url,num_retries=2):
print('dwnload :' url)
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print(e.reason)
html = None
if num_retries > 0:
if hasattr (e,'code') and 500 <= e.code < 600:
# recursively retry Sxx HTTP errors
return download ( url,num_retries-1)
return html
配置用户代理:
def download(url,user_agent='wsp',num_retries=2):
print(url)
headers = {'User-agent':user_agent}
rrequest = urllib2.Request(url,headers=headers)
try:
html = urllib2.urlopen(url).read()
except urllib2.URLError as e:
print(e.reason)
html = None
if num_retries > 0:
if hasattr (e,'code') and 500 <= e.code < 600:
# recursively retry Sxx HTTP errors
return download ( url,num_retries-1)
return html
---> 伪装浏览器
正则匹配:
def crawl_sitemap(url):
sitemap = download(url)
links = re.findall('<loc>(.*?)</loc>',sitemap)
for link in links:
html = download(link)
页面规则匹配:
import itertools
for page in itertools.count(1):
url = 'http://example.webscraping.com/view/-%d'% page
html = download(url)
if html is None:
break
else:
pass
优化:
max errors = 5
# current number of consecutive download errors
num errors = 0
for page in itertools.count (l) :
url = 'http://example.webscraping.com/view/-%d'% page
html = download (url )
if html is None :
# rece ived an error trying to down load this webpage
num errors += 1
if num errors == max errors :
break
else:
num errors = 0
延时限制:
class Throttle :
def __init_(self, dlay) :
self.delay = delay
self.domains = {}
def wait(self, url ) :
domain = urlparse.ulparse(url).netloc
last_accessed = self.domains.get(domain )
if self.delay > 0 and last_accessed is not None :
sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).se conds
if sleep_secs > 0:
time.sleep(s leep secs )
self.domains [domain] = datetime.datetime.now()
throttle = Throttle(delay)
throttle.wait(url )
result = download(url,headers,proxy=proxy ,num_retries=num_retries )
爬虫陷进:
爬虫会跟踪所有之前没有访问过的链接。但是,一些网站会动态生成页面内容这样就会出现无限多的网页
解决思路:
1.记录到达当前网 页经过了多少个链接也就是深度,当到达最大深度时,爬虫就不再向队列中添加该网页中的链接
def link_crawl(...,max_depth=2):
max_depth = 2
seen = {}
depth = seen[url]
if depth != seen[url]:
for link in links:
if link not in seen:
seen[link] = depth +1
crawl_queue.append(link)