urllib.request.urlretrieve()直接爬取到本地
urlclear()清除urlretrieve()产生的缓存
info()当前环境
getcode()状态码 200是正常爬取,403不能爬取
geturl()获取网址
timeout超时设置 例如:
file=urllib.request.urlopen("http://www.baidu.com",timeout=1)
出现异常调试:
for i in range(0,100): try: file=urllib.request.urlopen("http://yum.iqianyue.com",timeout=1) data=file.read() print(len(data)) except Exception as e: print("出现异常:"+str(e))
解决中文问题
import urllib.request keywd="python" url="http://www.baidu.com/s?wd="+keywd req=urllib.request.Request(url) data=urllib.request.urlopen(req).read() fh=open("C:/Users/***/Desktop/***/2.html","wb") fh.write(data) fh.close()
解决中文问题 import urllib.request keywd="爬虫" keywd=urllib.request.quote(keywd) url="http://www.baidu.com/s?wd="+keywd req=urllib.request.Request(url) data=urllib.request.urlopen(req).read() fh=open("C:/Users/吕秋玉/Desktop/文献阅读/3.html","wb") fh.write(data) fh.close() 构造网址,把网址改为请求。
处理爬虫异常
URLError
1、连不上服务器
2、远程url不存在
3、本地没有网络
4、促发对应的HTTPError子类
处理异常:
import urllib.error import urllib.request try: urllib.request.urlopen("http://blog.csdn.net") except urllib.error.HTTPrror as e: if hasattr(e,"code"): print(e.code) if hasattr(e,"reason"): print(e,reason)
爬虫浏览器伪装技术:
import urllib.request url="http://news.sina.com.cn/" headers=("User-Agent"," Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299") opener=urllib.request.build_opener() opener.addheaders=[headers] data=opener.open(url).read() data.decode("utf-8","ignore") fh=open("C:/Users/***/Desktop/***/5.html","wb") fh.write(data) fh.close()
新闻爬虫:
import urllib.request import re data=urllib.request.urlopen("http://news.sina.com.cn/").read() data2=data.decode("utf-8","ignore") pat='herf="(http://news.sina.com.cn/.*?)"' allurl=re.compile(pat).findall(data2) for i in range(0,len(allurl)): try: print("第"+str(i)+"次爬取") thisurl=allurl[i] file="C:/Users/***/Desktop/***/sinanews/"+str(i)+".html" urllib.request.urlretrieve(thisurl,file) print("----成功----") except urllib.error.URLError as e: if hasattr(e,"code"): print(e,code) if hasattr(e,"reason"): print(e,reason)
作业:将新浪首页(http://news.sina.com.cn/)所有数据都爬到本地。
import urllib.request import re url="http://blog.csdn.net/" headers=("User-Agent"," Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299") opener=urllib.request.build_opener() opener.addheaders=[headers] urllib.request.install_opener(opener) data=opener.open(url).read() data=data.decode("utf-8","ignore") pat='<a strategy="recommend" href="(.*?)"' result=re.compile(pat).findall(data) for i in range(0,len(result)): file="C:/Users/***/Desktop/****/6"+str(i)+".html" urllib.request.urlretrieve(result[i],filename=file) print("第"+str(i+1)+"次爬取成功")
爬虫防屏蔽手段之代理服务器 网站西刺
import urllib.request def use_proxy(url,proxy_addr): proxy=urllib.request.ProxyHandler({"http":proxy_addr}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data=urllib.request.urlopen(url).read().decode("utf-8","ignore") return data proxy_addr="49.67.53.143:808" url="http://www.baidu.com" data=use_proxy(url,proxy_addr) print(len(data))
图片爬虫如下:
import urllib.request import re keywd="连衣裙" key=urllib.request.quote(keywd) headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299") opener=urllib.request.build_opener() opener.addheaders=[headers] urllib.request.install_opener(opener) for i in range(1,50): url="https://s.taobao.com/list?spm="+key+"&cat=50008898&bcoffset=12&s=str(i*60)" data=urllib.request.urlopen(url).read().decode("utf-8","ignore") pat='pic_url":"//(.*?)"' imagelist=re.compile(pat).findall(data) for j in range(0,len(imagelist)): thisimage=imagelist[j] thisimageurl="http://"+thisimage file="C:/Users/吕秋玉/Desktop/文献阅读/image/"+str(i)+str(j)+".jpg" urllib.request.urlretrieve(thisimageurl,filename=file)