zoukankan      html  css  js  c++  java
  • 25.爬取去哪儿网的商品数据-1


    1.首先分析页面信息
    页面地址:http://touch.qunar.com/
    爬取度假中的自由行频道信息
    可以看到某一城市xhr获取信息:
    
    
    
    

    request.url :

    https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

    这里可以看出url是拼接而成的,%开头的都是中文编译的字符串,这里是被转义后的数据。

    实际url:

    https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=广州&query=厦门自由行&dappDealTrace=false&mobFunction=扩展自由行&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=厦门自由行&limit=0,24&includeAD=true&qsact=search

    这里就分析一下url:

    dep参数:表示的是出发地(我在广州,所以定位的是广州)

    query和originalquery参数:表示的是目的地

    (因此只需要修改请求的这两个参数就能够遍历所有的商品信息,出发地,目的地组合会有不一样的数据呈现)

    浏览器打开url真实信息:

    2.获取出发点dep参数信息
    请求地址:https://touch.dujia.qunar.com/p/public/dep
    # 获取城市参数
    import
    requests url = 'https://touch.dujia.qunar.com/depCities.qunar' html = requests.get(url) # print(html.text) dict = html.json() for i in dict['data']: for j in dict['data'][i]: print(j)

    如图所示:

    3.根据出发地获取目的地参数
    
    import  requests
    url = 'https://touch.dujia.qunar.com/depCities.qunar'
    html = requests.get(url)
    # print(html.text)
    dict = html.json()
    #获取出发地参数
    for i in dict['data']:
        for j in dict['data'][i]:
            print(j)
            link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)
            html2 = requests.get(link_url)
            dict2 = html2.json()
            c_list = []
            #获取目的地参数
            for k in dict2['data']:
                for l in k['subModules']:
                    for m in l['items']:
                        city = m['query']
                #去重数据
    if city not in c_list: c_list.append(city) print(c_list)

    可以看到一个出发地对应有很多目的地:

    4.获取商品列表信息

    dep 和query 参数已经获取,接下来就是请求json加载的数据,分析其url变化及 页面重要的routeCount参数
    https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&limit=0,24&includeAD=true&qsact=search

    和limit的变化 每次请求是以24的倍数变化,通过获取routeCount参数,加载请求不同url。
    import  requests
    import urllib
    import random,time
    url = 'https://touch.dujia.qunar.com/depCities.qunar'
    html = requests.get(url)
    # print(html.text)
    dict = html.json()
    #获取出发地参数
    for i in dict['data']:
        for j in dict['data'][i]:
            print(j)
            link_url = 'https://touch.dujia.qunar.com/golfz/sight/arriveRecommend?dep={}&exclude=&extensionImg=255,175'.format(j)
    
            #设置随机休眠时间
            time.sleep(random.randint(1,2))
    
            html2 = requests.get(link_url)
            dict2 = html2.json()
            c_list = []
            #获取目的地参数
            for k in dict2['data']:
                for l in k['subModules']:
                    for m in l['items']:
                        city = m['query']
                        if city not  in c_list:
                            c_list.append(city)
            # print(c_list)
    
            #设置随机休眠时间
            time.sleep(random.randint(1,2))
    
            #请求数据
            for c in c_list:
                #配置请求url
                url3 = 'https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit=0,24&qsact=scroll'.format(urllib.request.quote(j),urllib.request.quote(city),urllib.request.quote(city))
                A = url3.replace('https://touch.dujia.qunar.com','')
                # print(A)
                headers = {
                    'cookie': 'QN48=tc_e1b5f5bb4d76a018_16730073949_ad75; csrfToken=d27163582839d6b8cbcb53110ed67077; QN300=organic; QN1=ezu0pVvzuB9qeVd2w90fAg==; _RF1=119.129.117.7; _RSG=AZ4soQG2oI5YMrcq1P6et8; _RDG=283bf2bcd3461d22ef1d94f9276d7c9b85; _RGUID=54b20906-b2d8-48ca-8de8-1990749b55a2; QN205=organic; QN234=home_free_t; _pk_ref.1.8600=%5B%22%22%2C%22%22%2C1542699072%2C%22http%3A%2F%2Ftouch.qunar.com%2F%22%5D; _pk_ses.1.8600=*; QN57=15427010307400.44337198739421924; QN58=1542701030742%7C1542701078367%7C4; QN233=dujia_hy_destination; _pk_id.1.8600=5f2ca9d25160d431.1542699072.1.1542705039.1542699072.; QN243=165',
                    'referer': 'https://touch.dujia.qunar.com/p/list?dep=%E5%B9%BF%E5%B7%9E&query=%E5%8E%A6%E9%97%A8%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&et=&it=dujia_hy_destination',
                    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
                }
    
                html3 = requests.get(url=url3,headers=headers)
                print(url3)
                print(html3.json())
                # # 获取 routeCount 参数
                # num = int(html3.json()['data']['limit']['routeCount'])
                #
                # # 每页只返回 24条数据
                # for n in range(0,num,24):
                #     url4 ='https://touch.dujia.qunar.com/list?modules=list%2CbookingInfo%2CactivityDetail&dep={}&query={}%E8%87%AA%E7%94%B1%E8%A1%8C&dappDealTrace=false&mobFunction=%E6%89%A9%E5%B1%95%E8%87%AA%E7%94%B1%E8%A1%8C&cfrom=zyx&it=dujia_hy_destination&date=&needNoResult=true&originalquery={}&limit={},24&qsact=scroll,n'
                #
                #     # 设置随机休眠时间
                #     time.sleep(random.randint(1, 2))
                #
                #     html4 = requests.get(url=url4,headers=headers)
                #     result = html4.json()
                #     print(result)
  • 相关阅读:
    SQL总结----存储过程
    SQL SERVER中的二种获得自增长ID的方法
    C#调用存储过程的ADO.Net
    扩展jQuery---选中指定索引的文本
    使用带参数的SQL语句向数据库中插入空值
    js中对小数取整
    Lr原理初识-慧测课堂笔记
    Https 安全传输的原理
    静态性能测试-慧测课堂笔记
    Docker常用命令
  • 原文地址:https://www.cnblogs.com/lvjing/p/9990608.html
Copyright © 2011-2022 走看看