zoukankan      html  css  js  c++  java
  • 数据分析淘宝口红评论(男生勿进

    一 前言

    本篇是知识追寻者随意在淘宝找了一款销量较高的口红进行数据分析,学完本篇读者将会使用词云库,基础的分词进行数据分析;

    公众号:知识追寻者

    知识追寻者(Inheriting the spirit of open source, Spreading technology knowledge;)

    二 登陆淘宝爬取口红评论

    地址 : https://detail.tmall.com/item.htm?spm=a230r.1.14.1.da793f34qRRoFb&id=594188372494&ns=1&abbucket=9

    是一款叫colorkey的学生口红;

    首先需要登陆淘宝然后打开网址

    其次开发者工具打开 network按照如下方式进行锁定url

    先在评论区复制关键字段黏贴进索引进行评论url定位;

    点击 url 获取关键字段

    • 请求 url
    • referer
    • cookie
    • User-Agent

    除了url将内容复制进行请求头,示例如下

    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
            'referer':'https://detail.tmall.com/item.htm?spm=a230r.1.14.1.da793f34qRRoFb&id=594188372494&ns=1&abbucket=9',
            'cookie':'.........'
    }
    

    为了分析评论需要大量的数据进行支撑,故选定250页的评论,currentPage 为当前页,我们只需要改变页码即可,抓取的url如下

    https://rate.tmall.com/list_detail_rate.htm?itemId=594188372494&spuId=1439508724&sellerId=4144020062&order=3&currentPage=1&.....
    

    然后发起请求进行数据爬取存入excel,整体代码如下

    # -*- coding: utf-8 -*-
    import re
    import requests
    import pandas as pd
    import time
    from urllib import error
    
    rate_list = []
    classify = []
    headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
            'referer':'https://detail.tmall.com/item.htm?spm=a230r.1.14.1.da793f34qRRoFb&id=594188372494&ns=1&abbucket=9',
            'cookie':'........'
    }
    for page in range(1,250,1):
        try:
            front = 'https://rate.tmall.com/list_detail_rate.htm?itemId=594188372494&spuId=1439508724&sellerId=4144020062&order=3&currentPage='
            rear = '&append=0&content=1&tagId=&posi=&picture=&groupId=&ua=098%23E1hvwvvEvbQvU9CkvvvvvjiPn25p6jtbn2LwzjivPmPvsjYRR2M96jDvP259AjibRsujvpvhvvpvv8wCvvpvvUmmvphvC9v9vvCvpbyCvm9vvvvvphvvvvvv96Cvpv3Zvvm2phCvhRvvvUnvphvppvvv96CvpCCvkphvC99vvOCgo8yCvv9vvUmgOg9MyvyCvhQUaGyvClsWa4AU%2B2DkLuc61WkwVzBO0f0DyBvOJ1kHsX7veC6AxYjxAfyp%2B3%2BIaNoxfBAKfvDrgjc6%2BulsbdmxfwkK5kx%2Fgj7QD46w2QhvCPMMvvvtvpvhvvvvvv%3D%3D&needFold=0&_ksTS=1585445591007_822&callback=jsonp823'
            url = front + str(page) + rear
            data = requests.get(url,headers=headers).text
            rate =  re.findall('"rateContent":"(.*?)","fromMall"',data)
            clazz = re.findall('"auctionSku":"(.*?)","anony"',data)
            rate_list.append(rate)
            classify.append(clazz)
            time.sleep(8)
            print('当前页%s'% page)
        except error.URLError as e:
            print(e)
    
    
    frame = pd.DataFrame()
    frame['评论'] = rate_list
    frame['分类'] = classify
    frame.to_excel('口红评论分类.xlsx')
    

    经过漫长的等待数据准备完毕;

    三 进行数据分析

    # -*- coding: utf-8 -*-
    import pandas as pd
    import matplotlib.pyplot as plt
    import jieba  #分词库
    from wordcloud import WordCloud ,ImageColorGenerator  #词云库
    from PIL import Image
    import numpy as np
    import re
    
    
    frame = pd.read_excel('../口红评论分类.xlsx')
    values = frame['评论'].values.tolist()
    segments =[]
    for value in values:
         # 字符串处理
         slic = value.replace('[','',1).replace(']','',1)
         for val in slic.split(','):
             seg_list = jieba.cut(val, cut_all=False)
             segments.append(seg_list)
    
    worlds = []
    
    # 字符串处理
    for segment in segments:
        for seg in segment:
            sub_str = re.sub(u"([^u4e00-u9fa5u0030-u0039u0041-u005au0061-u007a])", "", seg)
            if sub_str=='':
                pass
            else:
                worlds.append(sub_str)
    # 统计词个数
    word_count = pd.Series(data=worlds).value_counts()
    
    wc = WordCloud(font_path=r"C:WindowsWinSxSamd64_microsoft-windows-font-truetype-dengxian_31bf3856ad364e35_10.0.18362.1_none_2f009e78b33b73a9Dengb.ttf"
                   , background_color='white', width=350,
                   height=276, max_font_size=80,
                   max_words=1000)
    #获取头50个词语
    wc.fit_words(word_count[:100])
    
    # 定义背景
    image = Image.open(r'C:mydatageneratorpymain.jpg')
    graph = np.array(image)
    #从背景图片生成颜色值
    image_color = ImageColorGenerator(graph)
    # 重新设定颜色
    wc.recolor(color_func=image_color)
    wc.to_file('lipstick.png')
    
    # 指定所绘图名称
    plt.figure("口红评论")
    # 以图片的形式显示词云
    plt.imshow(wc)
    # 关闭图像坐标系
    plt.axis("off")
    plt.show()
    

    生成图片如下,女士们对这款口红总体来说比较满意,喜欢;

    四 参考文档

    【词云】https://blog.csdn.net/csdn2497242041/article/details/77175112

    【词云】https://blog.csdn.net/FontThrone/article/details/72782499

    【中文正则匹配】 https://blog.csdn.net/jlulxg/article/details/84650683

  • 相关阅读:
    Java中的多线程你只要看这一篇就够了
    模板CodeTemplate
    mybatis_mysql
    Kettle 使用Json输入
    图解ByteBuffer
    双队列缓存
    log4j.properties配置详解与实例
    Kettle 使用Json输入
    JSON响应端模拟测试
    使用Kettle导出excel
  • 原文地址:https://www.cnblogs.com/zszxz/p/12843415.html
Copyright © 2011-2022 走看看