zoukankan      html  css  js  c++  java
  • py3+urllib+re,爬虫下载捧腹网图片

    实现原理及思路请参考我的另外几篇爬虫实践博客

    py3+urllib+bs4+反爬,20+行代码教你爬取豆瓣妹子图:http://www.cnblogs.com/UncleYong/p/6892688.html
    py3+requests+json+xlwt,爬取拉勾招聘信息:http://www.cnblogs.com/UncleYong/p/6960044.html
    py3+urllib+re,轻轻松松爬取双色球最近100期中奖号码:http://www.cnblogs.com/UncleYong/p/6958242.html

    实现代码如下:

    import urllib.request, re
    
    # 获取网页源码
    def page(pg):
    	url = 'https://www.pengfu.com/index_%s.html'%pg
    	# 页面是utf8编码,所有解码成unicode
    	html = urllib.request.urlopen(url).read().decode('utf8') # <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    	# print(html)
    	return html
    
    # 获取标题
    def title(html):
    	reg = re.compile(r'<h1 class="dp-b"><a href=".*?" target="_blank">(.*?)</a>') # r表示防止转义
    	item = re.findall(reg, html)
    	# print(item)
    	return item
    
    # 获取图片url
    def content(html):
    	# html = page(1)
    	reg = r'<img src="(.*?)" width='
    	item = re.findall(reg, html)
    	# print(item)
    	return item
    
    def download(url, name):
    	path = 'image\%s.jpg'%name#.decode('utf-8').encode('gbk') # win下只识别gbk
    	urllib.request.urlretrieve(url, path)
    
    for i in range(5,9):
    	html = page(i)
    	title_list = title(html)
    	content_list = content(html)
    	for m, n in zip(title_list, content_list): # 把标题和图片对个对应
    		print('正在下载>>>>>:' + m, n)
    		download(n, m)	
    

  • 相关阅读:
    with ,Row_Number,DateDiff,DateAdd用法学习
    jmeter 读取mysql数据库
    fidder 自动保存请求内容
    redis 常用方法整理
    解决:EXCEL复制粘贴,精度丢失
    MYSQL 创建常见问题
    MYSQL 存储过程、函数、临时表、游标
    MYSQL 测试常用语句使用技巧
    3-6
    selenium3 下载、配置
  • 原文地址:https://www.cnblogs.com/uncleyong/p/6973887.html
Copyright © 2011-2022 走看看