zoukankan html css js c++ java

python 爬取图片

使用python的requests库爬取网页时，获取文本一般使用text方法，如果要获取图片并保存要用content

举个栗子，爬煎蛋网的图：

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import requests
import re
import os

url="http://jandan.net/ooxx"
s = requests.session()
header_jandan={'Host': 'jandan.net',
        'Connection': 'keep-alive',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Referer': 'http://jandan.net/ooxx',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language':'zh-CN,zh;q=0.8'}
resp = s.get(url,headers=header_jandan,timeout=10)
if len(resp.text) < 1500:
    resp2 = s.get(url,headers=header_jandan,timeout=10)
    text=resp2.text
else:
    text=resp.text
#print rn.text
img_url=re.findall(ur'(?<=<img src=").*?(?=")',text)
d=os.getcwd()
for i in img_url:
    ret=i.split("/")
    file = ret[-1]
    #print file
    if i.find("http") == -1:
        url_img="http:"+i
        r_img=s.get(url_img,headers=header_jandan,timeout=10)
        open(os.path.join(d,file), 'wb+').write(r_img.content)
        print "write %s" % file

考虑到如果图片很大，获取需要时间，设置timeout超时避免内容取不完整。

写文件内容为r_img.content

打开文件的方式使用wb+，二进制文件覆盖方式写入。

查看全文

相关阅读:
JVM系列六（自定义插入式注解器）.
JVM系列五（Javac 字节码编译器）.
2019 — 求不得，放不下
 Mybatis 条件判断单双引号解析问题
 JVM系列四（对象分配策略）.
JVM系列三（垃圾收集器）.
Spring MVC -- Spring Tool Suite和Maven（安装Tomcat、JDK）
Spring MVC -- 单元测试和集成测试
 Spring MVC -- 下载文件
 Spring MVC -- 上传文件

原文地址：https://www.cnblogs.com/taurusfy/p/7158801.html