zoukankan      html  css  js  c++  java
  • python爬虫-唯品会商品信息实战步骤详解

    唯品会商品信息实战

    • ​1. 目标网址和页面解析

    • 2. 爬虫初探

    • 3. 爬虫实操

      • 3.1 进行商品id信息的爬取

      • 3.2 商品id数据url构造

      • 3.3 商品id数据格式转化及数量验证

      • 3.4 商品详细信息获取

    • 4. 全部代码

    1. 目标网址和页面解析

    唯品会官网中假如搜索护肤套装,返回的页面如下


    下拉右侧滚动条可以发现,滑动到下面的时候页面会自动刷新出商品的数据,这里就体现了ajax交互,说明商品的信息是存放在json接口中,接着拉到底就可以发现翻页的按钮了,如下

    2. 爬虫初探

    尝试进行抓包,获取真实商品数据所在的网址页面,首先鼠标右键进入检查界面,点击Network后刷新页面,这时候就会返回请求的信息,需要进行查找筛选,找到具体含有商品信息的链接文件,经过检查发现内容大多在callback有关的文件中,如下


    分析这七个文件,发现有用的只有四个,其中第二个rank文件包含了当前页面的所有商品的编号


    然后剩下的3个v2文件中就是将这120个商品进行拆分,分别如下(商品的序号都是从0开始的)




    因此搜索页面的120个商品的信息真实的数据接口就查找完毕了,然后以其中的某一个链接文件进行爬虫数据的获取尝试,看看获得结果如何,然后总结规律看看是否可以同时爬取该页面中全部的数据

    添加user-agent,cookie,refer相关信息后设置后请求头(鼠标点击Headers),把页面接口数据的url复制粘贴后赋值,并进行数据请求,代码如下,比如先以20个商品的数据进行请求


    获取cookie,可以取消callback的筛选,然后选择默认返回的第一个suggest文件,如下


    注意:根据自己的浏览器返回的内容设置请求头headers

    import requests
    
    headers = {
        'Cookie': 'vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
        'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    
    url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C6918479374036836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440'
    html = requests.get(url,headers=headers)
    print(html.text)

    输出结果为:(最终的输出结果与界面返回的结果一致)


    因此就可以探究一下这三个v2文件中的实际请求url之间的区别,方便找出其中的规律

    'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918324165453150280%2C6918256118899745105%2C6918357885382468749%2C6918449056102396358%2C6918702822359352066%2C6918479374036836673%2C6918814278458725896%2C6918585149106754305%2C6918783763771922139%2C6917924417817122013%2C6918747787667990790%2C6918945825686792797%2C6918676686121468885%2C6918690813799719966%2C6917924776628925583%2C6918808484587649747%2C6918524324182323338%2C6917924083191145365%2C6917924119199990923%2C6917924081998898069%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865440'
    'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets1&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918241720044454476%2C6917919624790589569%2C6917935170607219714%2C6918794091804350029%2C6918825617469761228%2C6918821681541400066%2C6918343188631192386%2C6918909902880919752%2C6918944714357405314%2C6918598446593061836%2C6917992439761061707%2C6918565057324098974%2C6918647344809112386%2C6918787811445699149%2C6918729979027610590%2C6918770949378056781%2C6918331290238460382%2C6918782319292540574%2C6918398146810241165%2C6918659293579989333%2C6917923814107067291%2C6918162041180009111%2C6918398146827042957%2C6917992175963801365%2C6918885216264034310%2C6918787811496047181%2C6918273588862755984%2C6917924752735125662%2C6918466082515404493%2C6918934739456193886%2C6917924837261255565%2C6918935779609622221%2C6917920117494382747%2C6917987978233958977%2C6917923641027928222%2C6918229910205674453%2C6917970328155673856%2C6918470882161509397%2C6918659293832008021%2C6918750646128649741%2C6917923139576259723%2C6918387987850605333%2C6917924445491982494%2C6918790938962557837%2C6918383695533143067%2C6918872378378761054%2C6918640250037793602%2C6918750646128641549%2C6917937020463562910%2C6917920520629265102%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865436'
    'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets2&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds=6918690813782926366%2C6918447252612175371%2C6918159188446941835%2C6918205147496443989%2C6918006775182997019%2C6918710130501497419%2C6917951703208964235%2C6918936224464094528%2C6918394023211385035%2C6918872268898919262%2C6918397905200202715%2C6918798460682221086%2C6918800888595138517%2C6917919413703328321%2C1369067222846365%2C6917924520139822219%2C6918904223283803413%2C6918507022166130843%2C6918479374087209281%2C6917924176900793243%2C6918750646145443341%2C6918449056102412742%2C6918901362318117467%2C6918570897095177292%2C6917924520223884427%2C6918757924517328902%2C6918398146827051149%2C6918789686747831253%2C6918476662192264973%2C6917919300445017109%2C6917919922739126933%2C6917920155539928286%2C6918662208810186512%2C6917923139508970635%2C6918859281628675166%2C6918750645658871309%2C6918820034693202694%2C6918689681141637573%2C6917919916536480340%2C6918719763326603415%2C6918659293579997525%2C6917920335390225555%2C6918589584225669211%2C6918386595131470421%2C6918640034622429077%2C6917923665227256725%2C6918331290238476766%2C6917924054840074398%2C6917924438479938177%2C6917920679932125915%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600158865437'

    对比三个商品信息的url,发现根本的区别就是在于中间的productIds参数,因此只要获取到所有商品的id就可以获取全部的商品的信息,这也就是发现url的规律


    刚好全部的商品的id又存放在第二个rank文件中,故需要首先请求一下这个链接文件,获取商品id信息,然后再重新组合url,最终获取商品详细的信息

    3. 爬虫实操

    3.1 进行商品id信息的爬取

    为了实现翻页的要求,可以查找一下控制每页数量的参数,如下,比如第一页共120条数据,其中的pageOffset参数为0

    第二页中的pageOffset参数为120,由此类推,第三页的参数为240,往后每翻一页数量增加120条,其余部分参数几乎没变

    3.2 商品id数据url构造

    因此请求的代码如下

    import requests
    import json
    headers = {
        'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
        'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    n = 1 #n就是用来确定请求的页数,可以使用input语句替代
    for num in range(120,(n+1)*120,120):  #这里是从第二页开始取数据了,第一个参数可以设置为0
        url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&keyword=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435'
        html = requests.get(url,headers=headers)
        print(html.text)

    输出的结果为:(可以成功获得商品id的信息)

    3.3 商品id数据格式转化及数量验证

    进行json数据的解析,也就是将输出的数据没有固定格式的转化为可以python操作的格式,代码如下

    import json
    
    #注意下面的代码是在for循环中
    start = html.text.index('{')
    end = html.text.index('})')+1
    json_data = json.loads(html.text[start:end])
    print(json_data)

    输出的结果为:(包含了想要的商品数据的id信息)


    验证一下是否为全部商品数据量,也就是获取的商品的id数量(这里就是pid字段数据)是否等于120,代码如下

    #同样也是在for循环下
    print(json_data['data']['products'])
    print('')
    print(len(json_data['data']['products']))

    输出的结果为:(验证完毕,注意第一个print输出的是一个列表嵌套字典的数据)

    3.4 商品详细信息获取

    因此就可以再次遍历循环获取每一个商品的id信息了,注意这里的product_url的构造,将中间的商品id的信息全部删除然后使用format方法进行替换即可,代码如下

    #在上面的for循环之中
    for product_id in product_ids:
        print('商品id',product_id['pid'])
        product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={}%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(product_id['pid'])
        product_html = requests.get(product_url,headers = headers)
        print(product_html.text)

    输出的结果为:(截取部分输出结果)


    可以发现和最初获取商品id信息一样,具体的信息数据也需要进行格式的转换,然后再提取,比如提取商品的名称,品牌和价格

    #这里以获取前10个商品数据为例进行展示
    product_start = product_html.text.index('{')
    product_end = product_html.text.index('})')+1
    product_json_data = json.loads(product_html.text[product_start:product_end])
    product_info_data = product_json_data['data']['products'][0]
    # print(product_info_data)
    product_title = product_info_data['title']
    product_brand = product_info_data['brandShowName']
    product_price = product_info_data['price']['salePrice']
    print('商品名称:{},品牌:{},折后价格:{}'.format(product_title,product_brand,product_price))

    输出的结果为:(可以正常获取相关的信息,这里就以商品的标题,品牌和售卖价格举例,还可以获取其他更为详尽的数据)


    最后一步就是将获取的数据写入本地:

    with open('vip.txt','a+',encoding = 'utf-8') as f:
        f.write('商品名称:{},品牌:{},折后价格:{}
    '.format(product_title,product_brand,product_price))

    输出结果为:(数据爬取完毕,并保存与本地)

    4. 全部代码

    可以将整个过程封装为函数,也可以将数据以csv或者xlsx的形式存放在本地,这里只列举了txt文本数据的存储

    import requests
    import json
    
    headers = {
        'Cookie': 'vip_province_name=%E6%B2%B3%E5%8D%97%E7%9C%81; vip_city_name=%E4%BF%A1%E9%98%B3%E5%B8%82; vip_city_code=104101115; vip_wh=VIP_HZ; vip_ipver=31; user_class=a; mars_sid=ff7be68ad4dc97e589a1673f7154c9f9; VipUINFO=luc%3Aa%7Csuc%3Aa%7Cbct%3Ac_new%7Chct%3Ac_new%7Cbdts%3A0%7Cbcts%3A0%7Ckfts%3A0%7Cc10%3A0%7Crcabt%3A0%7Cp2%3A0%7Cp3%3A1%7Cp4%3A0%7Cp5%3A0%7Cul%3A3105; mars_pid=0; visit_id=98C7BA95D1CA0C0E518537BD0B4ABEA0; vip_tracker_source_from=; pg_session_no=5; mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375',
        'Referer': 'https://category.vip.com/suggest.php?keyword=%E6%8A%A4%E8%82%A4&ff=235|12|1|1',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
    }
    
    n = 1 #注意这里的n就代表你要爬取的实际页码数
    for num in range(0,n*120,120): 
        url = f'https://mapi.vip.com/vips-mobile/rest/shopping/pc/search/product/rank?callback=getMerchandiseIds&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&standby_id=nature&keyword=%E6%8A%A4%E8%82%A4%E5%A5%97%E8%A3%85&lv3CatIds=&lv2CatIds=&lv1CatIds=&brandStoreSns=&props=&priceMin=&priceMax=&vipService=&sort=0&pageOffset={num}&channelId=1&gPlatform=PC&batchSize=120&_=1600158865435'
        html = requests.get(url,headers=headers)
        # print(html.text)
    
        start = html.text.index('{')
        end = html.text.index('})')+1
        json_data = json.loads(html.text[start:end])
        product_ids = json_data['data']['products']
        for product_id in product_ids:
            print('商品id',product_id['pid'])
            product_url = 'https://mapi.vip.com/vips-mobile/rest/shopping/pc/product/module/list/v2?callback=getMerchandiseDroplets3&app_name=shop_pc&app_version=4.0&warehouse=VIP_HZ&fdc_area_id=104101115&client=pc&mobile_platform=1&province_id=104101&api_key=70f71280d5d547b2a7bb370a529aeea1&user_id=&mars_cid=1600153235012_7a06e53de69c79c1bad28061c13e9375&wap_consumer=a&productIds={}%2C&scene=search&standby_id=nature&extParams=%7B%22stdSizeVids%22%3A%22%22%2C%22preheatTipsVer%22%3A%223%22%2C%22couponVer%22%3A%22v2%22%2C%22exclusivePrice%22%3A%221%22%2C%22iconSpec%22%3A%222x%22%7D&context=&_=1600164018137'.format(product_id['pid'])
            product_html = requests.get(product_url,headers = headers)
            product_start = product_html.text.index('{')
            product_end = product_html.text.index('})')+1
            product_json_data = json.loads(product_html.text[product_start:product_end])
            product_info_data = product_json_data['data']['products'][0]
            # print(product_info_data)
            product_title = product_info_data['title']
            product_brand = product_info_data['brandShowName']
            product_price = product_info_data['price']['salePrice']
            print('商品名称:{},品牌:{},折后价格:{}'.format(product_title,product_brand,product_price))
            with open('vip.txt','a+',encoding = 'utf-8') as f:
                f.write('商品名称:{},品牌:{},折后价格:{}
    '.format(product_title,product_brand,product_price))

    这里假使n=4,再次运行代码,输出的结果如下:(为了查看数据量,使用sublime打开txt文件,可以发现刚好是4页商品的数量总和,因此整个唯品会商品的信息的爬取至此完结)


    欢迎关注公众号:Python爬虫数据分析挖掘

    记录学习python的点点滴滴;

    回复【开源源码】免费获取更多开源项目源码;

    公众号每日更新python知识和【免费】工具;

    本文已同步到【开源中国】和【腾讯云社区】;

  • 相关阅读:
    SAP CRM WebClient UI的Delta处理机制介绍
    三种动态控制SAP CRM WebClient UI assignment block显示与否的方法
    SAPGUI软件里做的设置,本地操作系统保存的具体位置
    SAP CRM附件在应用服务器上的存储原理解析
    FLINK实例(2):CONNECTORS(1)如何正确使用 Flink Connector?
    shell脚本执行报错:/bin/bash^M: bad interpreter: No such file or directory
    FLINK实例(6): CONNECOTRS(5)Flink Kafka Connector 与 Exactly Once 剖析
    java.lang.IllegalStateException(Connection to remote Spark driver was lost)
    java.security.cert.CertificateNotYetValidException: NotBefore
    Hadoop问题:org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length 错误
  • 原文地址:https://www.cnblogs.com/chenlove/p/13778098.html
Copyright © 2011-2022 走看看