网上的爬虫不能用,还是先表达谢意,不过我比较懒不喜欢重复写别人写的教程,只贴出修改,怎么用自己看教程吧。
我自己改了一版可以正常爬:
#!/usr/bin/env python #coding=utf-8 # # Openwrt Package Grabber # # Copyright (C) 2016 sohobloo.me # import urllib2 import re import os import time # the url of package list page, end with "/" baseurl = 'https://downloads.openwrt.org/snapshots/trunk/ramips/mt7620/packages/' # which directory to save all the packages, end with "/" time = time.strftime("%Y%m%d%H%M%S", time.localtime()) savedir = './' + time + '/' pattern = r'<a href="([^?].*?)">'
cnt = 0
def fetch(url, path = ''): if not os.path.exists(savedir + path): os.makedirs(savedir + path) print 'fetching package list from ' + url content = urllib2.urlopen(url + path, timeout=15).read() items = re.findall(pattern, content)for item in items: if item == '../': continue elif item.endswith('/'): fetch(url, path + item) else: cnt += 1 print 'downloading item %d: '%(cnt) + path + item if os.path.isfile(savedir + path + item): print 'file exists, ignored.' else: rfile = urllib2.urlopen(baseurl + path + item) with open(savedir + path + item, "wb") as code: code.write(rfile.read()) fetch(baseurl) print 'done!'
修改内容:
1. 增加了一级当前时间格式的根目录
2. 修改正则,过滤无效的地址(问号开头)
3. 改为递归爬目录结构
另外很高兴Python知识终于可以用了,撒花。
想更新截图失败,博客园看上去是要死了。