zoukankan      html  css  js  c++  java
  • Price suggestion(EDA)上--简单数据挖掘

    https://www.kaggle.com/thykhuely/mercari-interactive-eda-topic-modelling

    该题为的目的在于通过商家给予的商品的信息,建立一个商品的估价模型。

    所给数据的大致分析:

    列含义的分析:
    name:商品名

    item_condition_id:卖方提供的物品的状况(不是很懂这个数据,可能是商品好坏状况)

    category_name:每个商品有相应的三个标签

    brand_name:品牌名

    price:价格(即我们要预测的值)

    shipping:1为运费为卖方支付,0为运费为买方支付

    item_description:对物品的描述

     

    1.首先对price我们的目标进行一个分析

    train.price.describe()

    可以清楚的看出数据的一些特性

    然后观察price的分布情况

    plt.subplot(1, 2, 1)
    (train['price']).plot.hist(bins=50, figsize=(20,10), edgecolor='white',range=[0,250])
    plt.xlabel('price+', fontsize=17)
    plt.ylabel('frequency', fontsize=17)
    plt.tick_params(labelsize=15)
    plt.title('Price Distribution - Training Set', fontsize=17)
    
    plt.subplot(1, 2, 2)
    np.log(train['price']+1).plot.hist(bins=50, figsize=(20,10), edgecolor='white')
    plt.xlabel('log(price+1)', fontsize=17)
    plt.ylabel('frequency', fontsize=17)
    plt.tick_params(labelsize=15)
    plt.title('Log(Price) Distribution - Training Set', fontsize=17)
    plt.show()

    此处使用log(price+1)的方法使数据更加接近正态分布。也可以尝试Box-Cox.

    2.对shipping运费进行分析

    统计两种运费方式的分布情况

    train.shipping.value_counts()/len(train)

    对于两种运费方式进行price的分析

    prc_shipBySeller = train.loc[train.shipping==1, 'price']
    prc_shipByBuyer = train.loc[train.shipping==0, 'price']
    
    fig, ax = plt.subplots(figsize=(20,10))
    ax.hist(np.log(prc_shipBySeller+1), color='#8CB4E1', alpha=1.0, bins=50,
           label='Price when Seller pays Shipping')
    ax.hist(np.log(prc_shipByBuyer+1), color='#007D00', alpha=0.7, bins=50,
           label='Price when Buyer pays Shipping')
    ax.set(title='Histogram Comparison', ylabel='% of Dataset in Bin')
    plt.xlabel('log(price+1)', fontsize=17)
    plt.ylabel('frequency', fontsize=17)
    plt.title('Price Distribution by Shipping Type', fontsize=17)
    plt.tick_params(labelsize=15)
    plt.show()

    3.对item_categroy(标签)的分析

    对所有标签种类的分析

    print("There are %d unique values in the category column." % train['category_name'].nunique())

    显示总数量前5的标签,以及标签的数量

    # TOP 5 RAW CATEGORIES
    train['category_name'].value_counts()[:5]

    查看空标签数量

    1 # missing categories
    2 print("There are %d items that do not have a label." % train['category_name'].isnull().sum())

    对标签的拆分

    # reference: BuryBuryZymon at https://www.kaggle.com/maheshdadhich/i-will-sell-everything-for-free-0-55
    def split_cat(text):
        try: return text.split("/")
        except: return ("No Label", "No Label", "No Label")
    
    train['general_cat'], train['subcat_1'], train['subcat_2'] = 
    zip(*train['category_name'].apply(lambda x: split_cat(x)))
    train.head()

    对每个小标签进行分析

    print("There are %d unique first sub-categories." % train['subcat_1'].nunique())

    print("There are %d unique second sub-categories." % train['subcat_2'].nunique())

    对general_cat进行分析

    x = train['general_cat'].value_counts().index.values.astype('str')
    y = train['general_cat'].value_counts().values
    pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))]
    trace1 = go.Bar(x=x, y=y, text=pct)
    layout = dict(title= 'Number of Items by Main Category',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='Category'))
    fig=dict(data=[trace1], layout=layout)
    py.iplot(fig)

    x = train['subcat_1'].value_counts().index.values.astype('str')[:15]
    y = train['subcat_1'].value_counts().values[:15]
    pct = [("%.2f"%(v*100))+"%"for v in (y/len(train))][:15]
    trace1 = go.Bar(x=x, y=y, text=pct,
                    marker=dict(
                    color = y,colorscale='Portland',showscale=True,
                    reversescale = False
                    ))
    layout = dict(title= 'Number of Items by Sub Category (Top 15)',
                  yaxis = dict(title='Count'),
                  xaxis = dict(title='SubCategory'))
    fig=dict(data=[trace1], layout=layout)
    py.iplot(fig)

    使用箱型图对标签数据进行分析

    general_cats = train['general_cat'].unique()
    x = [train.loc[train['general_cat']==cat, 'price'] for cat in general_cats]
    
    data = [go.Box(x=np.log(x[i]+1), name=general_cats[i]) for i in range(len(general_cats))]
    
    layout = dict(title="Price Distribution by General Category",
                  yaxis = dict(title='Frequency'),
                  xaxis = dict(title='Category'))
    fig = dict(data=data, layout=layout)
    py.iplot(fig)

    对brand_name(品牌名)的分析

    print("There are %d unique brand names in the training dataset." % train['brand_name'].nunique())

    x = train['brand_name'].value_counts().index.values.astype('str')[:10]
    y = train['brand_name'].value_counts().values[:10]
    
    # trace1 = go.Bar(x=x, y=y, 
    #                 marker=dict(
    #                 color = y,colorscale='Portland',showscale=True,
    #                 reversescale = False
    #                 ))
    # layout = dict(title= 'Top 10 Brand by Number of Items',
    #               yaxis = dict(title='Brand Name'),
    #               xaxis = dict(title='Count'))
    # fig=dict(data=[trace1], layout=layout)
    # py.iplot(fig)
  • 相关阅读:
    haproxy教程
    haproxy和keepalived的理解(转载)
    redis集群搭建_超详细
    磁盘IO过高时的参考
    tomcat优化
    MYSQL数据库的主从复制
    k8s学习笔记-etcd介绍和集群搭建
    python排序算法二---冒泡排序
    Python排序算法一—快速排序
    python:如何判断字典a在字典b
  • 原文地址:https://www.cnblogs.com/zhengzhe/p/8983730.html
Copyright © 2011-2022 走看看