zoukankan html css js c++ java

获取所有的列表

import urllib
import time
##读取指定的网址
url = []
page = 1
while page <= 11:
    url_con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1193111400_0_'+str(page)+'.html').read()
    print 'con' ,url_con

    i = 0
    title = url_con.find(r'<a title=')

    print "title",title
    href = url_con.find(r'href=',title)
    print "href",href

    html = url_con.find(r'.html',href)
    print "html",html


    while title != -1 and href != -1 and html != -1 and i < 40:
        url.append(url_con[href+6:html+5])
        print page,url[i]
        title = url_con.find(r'<a title=',html)
        
        href = url_con.find(r'href=',title)
        
        html = url_con.find(r'.html',href)
        
        filename = url[-26:]

        i = i + 1
    else:
        print page, 'find end'
    page = page + 1
else:
    print 'all find end !'
j = 0
k = len(url)
print "url sum:",k
while j < k:
    content = urllib.urlopen(url[j]).read()
    filename = url[j][-26:]
    open(r'blog/'+ filename,'w').write(content)
    j = j + 1
    time.sleep(5)

View Code

以上代码是获取所有博客文章列表，并读取其内容，并输出html

查看全文

相关阅读:
dataset的transformations-变形记
 创建dataset的方法
 Codeforces Round #479 (Div. 3) D. Divide by three, multiply by two
Codeforces Round #479 (Div. 3) C. Less or Equal
Codeforces Round #479 (Div. 3) B. Two-gram
Codeforces Round #479 (Div. 3) A. Wrong Subtraction
GlitchBot -HZNU寒假集训
 Floyd 算法求多源最短路径
 dijkstra算法：寻找到全图各点的最短路径
 Wooden Sticks -HZNU寒假集训

原文地址：https://www.cnblogs.com/y15821933792/p/7797211.html