zoukankan      html  css  js  c++  java
  • python多线程实现抓取网页

    Python实现抓取网页
    以下的Python抓取网页的程序比較0基础,仅仅能抓取第一页的url所属的页面,仅仅要预定URL足够多。保证你抓取的网页是无限级别的哈,以下是代码:


    ##coding:utf-8
    '''
    	无限抓取网页
    	@author wangbingyu
    	@date 2014-06-26
    '''
    import sys,urllib,re,thread,time,threading
    
    '''
    创建下载线程类
    '''
    class download(threading.Thread):
    	def __init__(self,url,threadName):
    		threading.Thread.__init__(self,name=threadName)
    		self.thread_stop = False
    		self.url = url
    	
    	def run(self):
    		while not self.thread_stop:
    			self.list = self.getUrl(self.url)
    			self.downloading(self.list)
    	
    	def stop(self):
    		self.thread_stop = True
    			
    	def downloading(self,list):
    		try:
    			for i in range(len(list) - 1):
    				urllib.urlretrieve(list[i],'E:uploaddownload\%s.html' %  time.time())
    		except Exception,ex:
    			print Exception,'_upload:',ex
    	
    	def getUrl(self,url):
    		result = []
    		s = urllib.urlopen(url).read();
    		ss = s.replace(' ','')
    		urls=re.findall('<a.*?href=.*?</a>',ss,re.I)
    		for i in urls:
    			tmp = i.split('"')
    			try:
    				if tmp[1]:
    					if re.match(r'http://.*',tmp[1]):
    						result.append(tmp[1])
    			except Exception,ex:
    				print Exception,":getUrl",ex 
    		return result
    
    if __name__ == '__main__':
    	list = ['http://www.baidu.com','http://www.qq.com','http://www.taobao.com','http://www.sina.com.cn']
    	for i in range(len(list)):
    		#print list[i]
    		download(list[i],'thread%s' % i).start()
    	#list = ['http://www.baidu.com','http://www.sina.com.cn']
    	#obj = download('http://www.baidu.com','threadName')
    	#obj.start();
    	
    input()



查看全文
  • 相关阅读:
    ZJCTF预赛一个.py的逆向题
    if(a)是什么意思
    整理OD一些快捷键和零碎知识点
    NSCTF-Reverse02 超级详细且简单的办法搞定
    CTF实验吧——证明自己吧
    Beat our dice game and get the flag 击败我们的骰子游戏拿到旗子
    CTF-Keylead(ASIS CTF 2015)
    【第三届强网杯】两道杂项题的wp
    【实验吧】该题不简单——writeup
    嵩天老师python网课爬虫实例1的问题和解决方法
  • 原文地址:https://www.cnblogs.com/ldxsuanfa/p/10951568.html
  • Copyright © 2011-2022 走看看