1.首先分析页面信息
页面地址:http://touch.qunar.com/
爬取度假中的自由行频道信息
可以看到某一城市xhr获取信息:
request.url :
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search
这里可以看出url是拼接而成的,%开头的都是中文编译的字符串,这里是被转义后的数据。
实际url:
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=广州&query=厦门自由行&dappDealTrace=false&mobFunction=扩展自由行&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=厦门自由行&limit=0,24&includeAD=true&qsact=search
这里就分析一下url:
dep参数:表示的是出发地(我在广州,所以定位的是广州)
query和originalquery参数:表示的是目的地
(因此只需要修改请求的这两个参数就能够遍历所有的商品信息,出发地,目的地组合会有不一样的数据呈现)
浏览器打开url真实信息:
2.获取出发点dep参数信息
请求地址:https://touch.dujia.qunar.com/p/public/dep
# 获取城市参数
import requests url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() for i in dict['data']: for j in dict['data'][i]: print(j)
如图所示:
3.根据出发地获取目的地参数 import requests url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() #获取出发地参数 for i in dict['data']: for j in dict['data'][i]: print(j) link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j) html2 = requests.get(link_url) dict2 = html2.json() c_list = [] #获取目的地参数 for k in dict2['data']: for l in k['subModules']: for m in l['items']: city = m['query']
#去重数据 if city not in c_list: c_list.append(city) print(c_list)
可以看到一个出发地对应有很多目的地:
4.获取商品列表信息
dep 和query 参数已经获取,接下来就是请求json加载的数据,分析其url变化及 页面重要的routeCount参数
https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search
和limit的变化 每次请求是以24的倍数变化,通过获取routeCount参数,加载请求不同url。
import requests import urllib import random,time url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() #获取出发地参数 for i in dict['data']: for j in dict['data'][i]: print(j) link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j) #设置随机休眠时间 time.sleep(random.randint(1,2)) html2 = requests.get(link_url) dict2 = html2.json() c_list = [] #获取目的地参数 for k in dict2['data']: for l in k['subModules']: for m in l['items']: city = m['query'] if city not in c_list: c_list.append(city) # print(c_list) #设置随机休眠时间 time.sleep(random.randint(1,2)) #请求数据 for c in c_list: #配置请求url url3 = 'https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit=0,24&qsact=scroll'.format(urllib.request.quote(j),urllib.request.quote(city),urllib.request.quote(city)) A = url3.replace('https://touch.dujia.qunar.com','') # print(A) headers = { 'cookie': 'QN48=tc_e1b5f5bb4d76a018_16730073949_ad75; csrfToken=d27163582839d6b8cbcb53110ed67077; QN300=organic; QN1=ezu0pVvzuB9qeVd2w90fAg==; _RF1=119.129.117.7; _RSG=AZ4soQG2oI5YMrcq1P6et8; _RDG=283bf2bcd3461d22ef1d94f9276d7c9b85; _RGUID=54b20906-b2d8-48ca-8de8-1990749b55a2; QN205=organic; QN234=home_free_t; _pk_ref.1.8600=%5B%22%22%2C%22%22%2C1542699072%2C%22http%3A%2F%2Ftouch.qunar.com%2F%22%5D; _pk_ses.1.8600=*; QN57=15427010307400.44337198739421924; QN58=1542701030742%7C1542701078367%7C4; QN233=dujia_hy_destination; _pk_id.1.8600=5f2ca9d25160d431.1542699072.1.1542705039.1542699072.; QN243=165', 'referer': 'https://touch.dujia.qunar.com/p/list?dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&et=&it=dujia_hy_destination', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36' } html3 = requests.get(url=url3,headers=headers) print(url3) print(html3.json()) # # 获取 routeCount 参数 # num = int(html3.json()['data']['limit']['routeCount']) # # # 每页只返回 24条数据 # for n in range(0,num,24): # url4 ='https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit={},24&qsact=scroll,n' # # # 设置随机休眠时间 # time.sleep(random.randint(1, 2)) # # html4 = requests.get(url=url4,headers=headers) # result = html4.json() # print(result)