zoukankan html css js c++ java

【安全】爬虫之旅（2）— 豆瓣小组图片爬虫

今天的第二发！受到“豆瓣妹子”网站的启发（不懂的自行百度），我觉得我也有责任有义务来一发！

毕竟自己写的程序合自己的口味嘛，也比较灵活，可随时更改需要爬的小组地址，下载不同口味的图片 ~

直接上程序，python3.3.2版本：

 1 # -*- coding: utf-8 -*-
 2 # -----------------------------------------------
 3 #   程序：豆瓣小组图片爬虫
 4 #   版本：1.0
 5 #   语言：Python 3.3.2
 6 #   作者：RAUL
 7 #   操作：输入豆瓣小组讨论版块地址、起始页面、终止页面
 8 #   功能：下载小组帖子里发布的图片
 9 #   注意：下载的保存地址为作者本机地址 读者根据自身情况更改
10 # -----------------------------------------------
11 
12 import urllib.request,re,time
13 
14 def get_html_response(url):
15     html_response = urllib.request.urlopen(url).read().decode('utf-8')
16     return html_response
17 
18 # ------------ begin ----------------------------
19 # 输入示例
20 # http://www.douban.com/group/Xsz/discussion?start=
21 # 1
22 # 2
23 
24 url = str(input(u'请输入豆瓣小组地址，去掉start=后面的数字：
'))
25 page_bgn = int(input(u'请输入开始时的页码:
'))
26 page_end = int(input(u'请输入结束时的页码:
'))
27 num_end = (page_end-1)*25
28 num_now = (page_bgn-1)*25
29 
30 while num_now <= num_end:
31     # 获得主题列表页面
32     html_topic_list = get_html_response(url+str(num_now))
33 
34     # 获得主题列表
35     re_topic_list = r'http://www.douban.com/group/topic/d+'
36     topic_list = re.findall(re_topic_list,html_topic_list)
37 
38     # 遍历每个主题 将其中图片下载下来
39     for topic_url in topic_list:
40         print('topic_url '+topic_url)
41         html_topic = get_html_response(topic_url)
42 
43         # 进入主题 获得图片下载地址列表（图片可能有多张）
44         re_img_list = r'http://imgd.douban.com/view/group_topic/large/public/.+.jpg'
45         img_list = re.findall(re_img_list,html_topic)
46 
47         # 遍历图片下载地址列表 把每张图片保存到对应位置
48         for img_url in img_list:
49             print('img_url: '+img_url)
50             img_name = re.findall(r'pd{7}',img_url)
51             download_img = urllib.request.urlretrieve(img_url,'D:MySoftwarePython33CodeSpidergirls\%s.jpg'%img_name)
52             time.sleep(0.5)
53     num_now = num_now + 25
54 else:
55     print('采集完成!')

思路是这样：

1、输入的地址是豆瓣小组讨论版地址（去掉start=后面的） eg.http://www.douban.com/group/Xsz/discussion?start=

2、程序把以上地址中的各个帖子地址汇编进列表，得到主题列表

3、遍历主题列表，进入主题，获取该主题中的所有图片地址列表

4、遍历图片地址列表，将每张图片保存到本地

挺有成就感的 .. 明天不能继续了 .. 要ASP.NET搞起了

查看全文

相关阅读:
互斥锁Mutex与信号量Semaphore的区别
 c/c++强制类型转换
 c++中的隐藏、重载、覆盖（重写）
运算符重载详解
 类的大小
 C++ static、const和static const 以及它们的初始化
 一种隐蔽性较高的Java ConcurrentModificationException异常场景
 Java编码常见的Log日志打印问题
 Java编程常见缺陷汇总(一)
Java字符串连接的多种实现方法及效率对比

原文地址：https://www.cnblogs.com/raul-ac/p/3502597.html