python爬取diameizi网页,然后下载图片
python 环境是2.7.3
代码地址:https://gist.github.com/zjjott/5270366
作者讨论地址:http://tieba.baidu.com/p/2239765168?fr=itb_feed_jing#30880553662l
需要抓的美女图片地址:http://diameizi.diandian.com/
1 #coding=utf-8 2 import os 3 os.system("wget -r --spider http://diameizi.diandian.com 2>|log.txt")#非常简单的抓取整个网页树结构的语句————实质上是一种偷懒 4 filein=open('log.txt','r') 5 fileout=open('dst','w+')#一个装最后的结果的没用的文件 6 filelist=list(filein) 7 import urllib2,time 8 from bs4 import BeautifulSoup 9 header={ 10 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:8.0.1) Gecko/20100101 Firefox/8.0.1'} 11 def getsite(url): 12 req=urllib2.Request(url,None,header) 13 site=urllib2.urlopen(req) 14 return site.read()##上面这六句基本万金油了。。 15 try: 16 dst=set() 17 for p in filelist: 18 if p.find('http://diameizi.diandian.com/post')>-1: 19 p=p[p.find('http'):] 20 dst.add(p) 21 i=0 22 for p in dst: 23 #if i<191: 24 # i+=1 25 # continue##断点续传部分 26 pagesoup=BeautifulSoup(getsite(p)) 27 pageimg=pagesoup.find_all('img') 28 for href in pageimg: 29 print i,href['src'] 30 picpath="pic/"+href['src'][-55:-13]+href['src'][-4:]##名字的起法有问题。。。不过效果还行。。 31 pic=getsite(href['src']) 32 picfile=open(picpath,'wb') 33 picfile.write(pic) 34 i+=1 35 picfile.close() 36 finally: 37 for p in dst: 38 fileout.write(p) 39 fileout.close()
上面的log.txt
文件大体就是下面的内容。
Spider mode enabled. Check if remote file exists. --2013-03-29 23:00:10-- http://diameizi.diandian.com/ Resolving diameizi.diandian.com (diameizi.diandian.com)... 113.31.29.120, 113.31.29.121 Connecting to diameizi.diandian.com (diameizi.diandian.com)|113.31.29.120|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 30502 (30K) [text/html] Remote file exists and could contain links to other resources -- retrieving. --2013-03-29 23:00:11-- http://diameizi.diandian.com/ Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `diameizi.diandian.com/index.html' 0K .......... .......... ......... 94.6K=0.3s 2013-03-29 23:00:12 (94.6 KB/s) - `diameizi.diandian.com/index.html' saved [30502] Loading robots.txt; please ignore errors. --2013-03-29 23:00:12-- http://diameizi.diandian.com/robots.txt Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: 209 [text/plain] Saving to: `diameizi.diandian.com/robots.txt' 0K 100% 20.8M=0s 2013-03-29 23:00:12 (20.8 MB/s) - `diameizi.diandian.com/robots.txt' saved [209/209] Removing diameizi.diandian.com/robots.txt. Removing diameizi.diandian.com/index.html. Spider mode enabled. Check if remote file exists. --2013-03-29 23:00:12-- http://diameizi.diandian.com/rss Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: 0 [text/xml] Remote file exists but does not contain any link -- not retrieving. Removing diameizi.diandian.com/rss. unlink: No such file or directory Spider mode enabled. Check if remote file exists. --2013-03-29 23:00:12-- http://diameizi.diandian.com/archive Reusing existing connection to diameizi.diandian.com:80. HTTP request sent, awaiting response... 200 OK Length: 82303 (80K) [text/html] Remote file exists and could contain links to other resources -- retrieving. --2013-03-29 23:00:12-- http://diameizi.diandian.com/archive Reusing existing connection to diameizi.diandian.com:80.
从上面的文本文件中寻找需要的相关资料。
上面的代码还没有测试成功,因为是2.7.3平台的缘故吧。
例子上给的应该是python3.x版本。有些出入