zoukankan      html  css  js  c++  java
  • 25.爬取去哪儿网的商品数据-1


    1.首先分析页面信息
    页面地址:http://touch.qunar.com/
    爬取度假中的自由行频道信息
    可以看到某一城市xhr获取信息:
    
    
    
    

    request.url :

    https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

    这里可以看出url是拼接而成的,%开头的都是中文编译的字符串,这里是被转义后的数据。

    实际url:

    https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=广州&query=厦门自由行&dappDealTrace=false&mobFunction=扩展自由行&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=厦门自由行&limit=0,24&includeAD=true&qsact=search

    这里就分析一下url:

    dep参数:表示的是出发地(我在广州,所以定位的是广州)

    query和originalquery参数:表示的是目的地

    (因此只需要修改请求的这两个参数就能够遍历所有的商品信息,出发地,目的地组合会有不一样的数据呈现)

    浏览器打开url真实信息:

    2.获取出发点dep参数信息
    请求地址:https://touch.dujia.qunar.com/p/public/dep
    # 获取城市参数
    import
    requests url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() for i in dict['data']: for j in dict['data'][i]: print(j)

    如图所示:

    3.根据出发地获取目的地参数
    
    import  requests
    url = 'https://touch.dujia.qunar.com/depCities.qunar'
    html = requests.get(url)
    # print(html.text)
    dict = html.json()
    #获取出发地参数
    for i in dict['data']:
        for j in dict['data'][i]:
            print(j)
            link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)
            html2 = requests.get(link_url)
            dict2 = html2.json()
            c_list = []
            #获取目的地参数
            for k in dict2['data']:
                for l in k['subModules']:
                    for m in l['items']:
                        city = m['query']
                #去重数据
    if city not in c_list: c_list.append(city) print(c_list)

    可以看到一个出发地对应有很多目的地:

    4.获取商品列表信息

    dep 和query 参数已经获取,接下来就是请求json加载的数据,分析其url变化及 页面重要的routeCount参数
    https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

    和limit的变化 每次请求是以24的倍数变化,通过获取routeCount参数,加载请求不同url。
    import  requests
    import urllib
    import random,time
    url = 'https://touch.dujia.qunar.com/depCities.qunar'
    html = requests.get(url)
    # print(html.text)
    dict = html.json()
    #获取出发地参数
    for i in dict['data']:
        for j in dict['data'][i]:
            print(j)
            link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)
    
            #设置随机休眠时间
            time.sleep(random.randint(1,2))
    
            html2 = requests.get(link_url)
            dict2 = html2.json()
            c_list = []
            #获取目的地参数
            for k in dict2['data']:
                for l in k['subModules']:
                    for m in l['items']:
                        city = m['query']
                        if city not  in c_list:
                            c_list.append(city)
            # print(c_list)
    
            #设置随机休眠时间
            time.sleep(random.randint(1,2))
    
            #请求数据
            for c in c_list:
                #配置请求url
                url3 = 'https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit=0,24&qsact=scroll'.format(urllib.request.quote(j),urllib.request.quote(city),urllib.request.quote(city))
                A = url3.replace('https://touch.dujia.qunar.com','')
                # print(A)
                headers = {
                    'cookie': 'QN48=tc_e1b5f5bb4d76a018_16730073949_ad75; csrfToken=d27163582839d6b8cbcb53110ed67077; QN300=organic; QN1=ezu0pVvzuB9qeVd2w90fAg==; _RF1=119.129.117.7; _RSG=AZ4soQG2oI5YMrcq1P6et8; _RDG=283bf2bcd3461d22ef1d94f9276d7c9b85; _RGUID=54b20906-b2d8-48ca-8de8-1990749b55a2; QN205=organic; QN234=home_free_t; _pk_ref.1.8600=%5B%22%22%2C%22%22%2C1542699072%2C%22http%3A%2F%2Ftouch.qunar.com%2F%22%5D; _pk_ses.1.8600=*; QN57=15427010307400.44337198739421924; QN58=1542701030742%7C1542701078367%7C4; QN233=dujia_hy_destination; _pk_id.1.8600=5f2ca9d25160d431.1542699072.1.1542705039.1542699072.; QN243=165',
                    'referer': 'https://touch.dujia.qunar.com/p/list?dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&et=&it=dujia_hy_destination',
                    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
                }
    
                html3 = requests.get(url=url3,headers=headers)
                print(url3)
                print(html3.json())
                # # 获取 routeCount 参数
                # num = int(html3.json()['data']['limit']['routeCount'])
                #
                # # 每页只返回 24条数据
                # for n in range(0,num,24):
                #     url4 ='https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit={},24&qsact=scroll,n'
                #
                #     # 设置随机休眠时间
                #     time.sleep(random.randint(1, 2))
                #
                #     html4 = requests.get(url=url4,headers=headers)
                #     result = html4.json()
                #     print(result)
  • 相关阅读:
    bzoj2733 永无乡 平衡树按秩合并
    bzoj2752 高速公路 线段树
    bzoj1052 覆盖问题 二分答案 dfs
    bzoj1584 打扫卫生 dp
    bzoj1854 游戏 二分图
    bzoj3316 JC loves Mkk 二分答案 单调队列
    bzoj3643 Phi的反函数 数学 搜索
    有一种恐怖,叫大爆搜
    BZOJ3566 概率充电器 概率dp
    一些奇奇怪怪的过题思路
  • 原文地址:https://www.cnblogs.com/lvjing/p/9990608.html
Copyright © 2011-2022 走看看