zoukankan      html  css  js  c++  java
  • python 爬虫解析_1_

    爬虫总结_1_

    爬虫小点:

    1.网站背景调查
    	robots.txt, Sitemp  文件
    		
    以检查网站 构建的技术类型一-builtwith 模块
        > import builtwith
        > builtwith . parse ( 'http:/ /example . webscraping . com ''}
        --> 返回网站的所需技术
    
    pip install python-whois
    下面是使用该模块对appspot.com 这个域名进行WHOIS 查询时的返回
    结果
    
    以使用WHOIS协议查询域名的注册者是谁
    

    Crawling 爬取:

    # 下载网页
    import urllib2
    def download(url):
        return urllib2.urlopen(url).read()
    
    --> 函数将会下载网页并返回其HTML
    

    优化:

    import urllinb2
    
    def download(url):
        print('dwnload :' url)
        try:
            html = urllib2.urlopen(url).read()
        except urllib2.URLError as e:
            print(e.reason)
            html = None
        return html
    

    下载错误:

    
    1.重试下载:
    	如: 下载时遇到的错误经常是临时性的,服务器 过载时回 503
    Ser vice Unavailable错误
    	服务器返回的是404 Not Found 这种错误, 则说明该网页目
    前并不存在,
    	
    --》响应码: https: // tools.ietf.org/html/rfc7231#
    section- 6
    
    import urllinb2
    
    def download(url,num_retries=2):
        print('dwnload :' url)
        try:
            html = urllib2.urlopen(url).read()
        except urllib2.URLError as e:
            print(e.reason)
            html = None
    
    		if num_retries > 0:
    			if hasattr (e,'code') and 500 <= e.code < 600:
    				# recursively retry Sxx HTTP errors
    			return download ( url,num_retries-1)
    	return html 
    

    配置用户代理:

    def download(url,user_agent='wsp',num_retries=2):
        print(url)
        headers = {'User-agent':user_agent}
        rrequest = urllib2.Request(url,headers=headers)
        try:
            html = urllib2.urlopen(url).read()
        except urllib2.URLError as e:
            print(e.reason)
            html = None
    
    		if num_retries > 0:
    			if hasattr (e,'code') and 500 <= e.code < 600:
    				# recursively retry Sxx HTTP errors
    			return download ( url,num_retries-1)
    	return html 
                            ---> 伪装浏览器
    

    正则匹配:

    def crawl_sitemap(url):
        sitemap =  download(url)
        links = re.findall('<loc>(.*?)</loc>',sitemap)
        
        for link in links:
            html = download(link)
    

    页面规则匹配:

    import itertools
    for page in itertools.count(1):
        url = 'http://example.webscraping.com/view/-%d'% page
        html = download(url)
        if html is None:
            break
        else:
            pass
    

    优化:

    max errors = 5
    # current number of consecutive download errors
    num errors = 0
    for page in itertools.count (l) :
    	url = 'http://example.webscraping.com/view/-%d'% page
    	html = download (url )
    	if html is None :
    	# rece ived an error trying to down load this webpage
    		num errors += 1
    		if num errors == max errors :
                break
        else:
    		num errors = 0 
    

    延时限制:

    class Throttle :
    
    	def __init_(self, dlay) :
    		self.delay = delay
    		self.domains = {}
    	def wait(self, url ) :
    		domain = urlparse.ulparse(url).netloc
    		last_accessed = self.domains.get(domain )
    		if self.delay > 0 and last_accessed is not None :
    			sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).se conds
    		if sleep_secs > 0:
    			time.sleep(s leep secs )
    			self.domains [domain] = datetime.datetime.now() 
                
     throttle = Throttle(delay)
     throttle.wait(url )
     result = download(url,headers,proxy=proxy ,num_retries=num_retries ) 
    

    爬虫陷进:

    爬虫会跟踪所有之前没有访问过的链接。但是,一些网站会动态生成页面内容这样就会出现无限多的网页
    
    解决思路:
    	1.记录到达当前网 页经过了多少个链接也就是深度,当到达最大深度时,爬虫就不再向队列中添加该网页中的链接
        
    def link_crawl(...,max_depth=2):
        max_depth = 2
        seen = {}
        depth = seen[url]
        if depth != seen[url]:
            for link in links:
                if link not in seen:
                    seen[link] = depth +1
                    crawl_queue.append(link)
    
  • 相关阅读:
    get请求数据
    ajax (详细)
    DedeCMS去掉友情链接中“织梦链投放”“织梦链”的方法
    Metro UI CSS可以快速创建一个Windows 8风格的网站
    CSS实现垂直居中的5种方法
    解决jQuery冲突 noConflict
    dedecms pic_scroll.js和jquery-1.9.1.min.js
    js和jquery下拉菜单全攻略
    IE6、IE7、IE8、FireFox css line-height兼容问题
    IE开发利器-IE10中的F12开发者工具
  • 原文地址:https://www.cnblogs.com/shaozheng/p/12782966.html
Copyright © 2011-2022 走看看