对于页面的抓取,我们使用的是requests,现在大部分的网站都支持动态加载,我们在firefox f12后查找动态的url :http://www.meilishuo.com/aj/shop_list/goods?frame=1&page=0&shop_id=1001072849,这里的frame是变化的,因此我们只需要请求该网址即可,在请求的header中出现nt 参数,而且nt参数是变化的,我们猜测这可能是随时间变化的,而且是有有效期的;我们的工作是如何取得第一次的nt值?我们在访问http://www.meilishuo.com/shop/1001072849 返回的页面中找到了nt的值,ok 工作顺利解决
#-*- coding:utf-8 -*- import re import requests import codecs import simplejson if __name__=="__main__": session=requests.Session() search_header={'Host':'www.meilishuo.com', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0', 'Accept':'application/json, text/javascript, */*; q=0.01', 'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding':'gzip, deflate', 'X-Requested-With':'XMLHttpRequest',#异步加载ajax 'Referer':'http://www.meilishuo.com/shop/1001072849', 'Connection':'keep-alive'} response=requests.get('http://www.meilishuo.com/shop/1001072849?frm=rate_to_shop') info=re.search('"nt":"(.+?)",',response.content) search_header['nt']=info.group(1)#在header中增加nt选项 info1=re.search('<script>Meilishuo.config.poster0 = (.+?);fml.vars.notFluid = true;</script>',response.content)#取得静态页面的info b=simplejson.loads(info1.group(1)) totalNum = b['totalNum']#取得页数 page = int(totalNum)/20 for i in range(page+1): a=requests.get('http://www.meilishuo.com/aj/shop_list/goods?frame='+str(i)+'&page=0&shop_id=1001072849',headers=search_header) print a.headers j_a=simplejson.loads(a.content) print len(j_a['tInfo'])
未完待续,接下来的就是要把宝贝的url保存下来并保存为为本地图片
for key in j_a['tInfo']:
r=requests.get(key['goods_img'])
with open(key['goods_title']+".jpg","wb") as title:
title.write(r.content)