1.打开韩寒博客列表页面
http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html
目标是获取所有文章的超级链接
2.韩寒文章列表特征
<a title target... href=....html>
3.技术要点
·字符串函数find
·列表 list[-x:-y]
·文件读写
#coding:utf-8 import urllib import time url = ['']*350 page = 1 link = 1 while page <= 7: con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_'+ str(page) +'.html').read() title = con.find(r'<a title') href = con.find(r'href=',title) html = con.find(r'.html',href) i = 0 while title != -1 and href != -1 and html != -1 and i < 80: url[i] = con[href + 6:html +5] print link,' ',url[i] i = i + 1 title = con.find(r'<a title',html) href = con.find(r'href=',title) html = con.find(r'.html',href) link = link + 1 else: print page,'find end!' page = page + 1 j = 0 while j < 350: content = urllib.urlopen(url[j]).read() open(r'blog/'+url[j][-26:],'w+').write(content) j = j + 1 time.sleep(1) else: print 'download article finished!'
·循环体while
4.实现步骤
·能够在浏览器打开韩寒博客文章列表首页的博客网页
·从首页网页里获得博客上的所有文章链接
·所有文章列表网页里的文章链接
·下载所有链接HTML文件