Mercari Price Suggestion
最后的评估标准为 平均算术平方根误差Root Mean Squared Logarithmic Error.
[epsilon = sqrt { frac { 1 } { n } sum _ { i = 1 } ^ { n } left( log left( p _ { i } + 1 ight) - log left( a _ { i } + 1 ight) ight) ^ { 2 } } ] -
最后提交的文件为test_id ,price 包含两列数据,一列为测试数据中id,另一列为预测的价格
- train_id test_id 物品的编号,一个商品对应一个编号
- name 名称
- item_condition_id 物品状态
- category_name 品类
- brand_name 品牌
- price 物品售出的价格,测试集中不包含此列,此列也为我们要预测的值
- shipping 1 if shipping fee is paid by seller and 0 by buyer,也就是1代表包邮,0代表不包邮
- item_description 物品的详细描述,描述中已经除去带有价格标签的值,已用[rm]代替
import pandas as pd
import numpy as np
df = pd.read_csv('input/train.tsv',sep=' ')
data information
train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
train_id 1482535 non-null int64
name 1482535 non-null object
item_condition_id 1482535 non-null int64
category_name 1476208 non-null object
brand_name 849853 non-null object
price 1482535 non-null float64
shipping 1482535 non-null int64
item_description 1482531 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB
price distribution
count 1.482535e+06
mean 2.673752e+01
std 3.858607e+01
min 0.000000e+00
25% 1.000000e+01
50% 1.700000e+01
75% 2.900000e+01
max 2.009000e+03
Name: price, dtype: float64
import matplotlib.pyplot as plt
plt.subplot(1, 2, 1) # 要生成一行两列,这是第一个图plt.subplot('行','列','编号')
df.price.plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white', range = [0, 250])
plt.xlabel('price', fontsize=12)
plt.title('Price Distribution', fontsize=12)
plt.subplot(1, 2, 2)
np.log((df.price+1)).plot.hist(bins=50, figsize=(12, 6), edgecolor = 'white')
plt.xlabel('log(price+1)', fontsize=12)
plt.title('log(Price+1) Distribution', fontsize=12)
Text(0.5, 1.0, 'log(Price+1) Distribution')
- 价格特征为左偏态,需要将其转化为正太分布的数据,价格的分布主要集中在10-20左右,而最大的价格在2009,需要将其做对数转化,转化后,其对数分布为较为规则的正态分布
0 0.552726
1 0.447274
Name: shipping, dtype: float64
- 对于商家是否包邮,55%的商品不包邮,44.7%的商品包邮,需要看一下包邮是否对于价格影响
shipping_yes = df.loc[df['shipping'] == 1, 'price'] # 商家出运费
shipping_no = df.loc[df['shipping'] == 0, 'price'] # 买家出运费
fig,ax = plt.subplots(figsize=(8,5))
plt.title('price_distribution by shipping method')
print("不包邮平均的定价%s dollars" %(round(shipping_no.mean(),2)))
print("包邮平均的定价%s dollars" %(round(shipping_yes.mean(),2)))
不包邮平均的定价30.11 dollars
包邮平均的定价22.57 dollars
fig,ax = plt.subplots(figsize=(8,5))
plt.title('log(price+1)_distribution by shipping method')
处理category 数据
- 数据集中的name,cageory,brand,item_condition_id 都需要转化为category类型的数据
it_conditon_id vs price
- 常见的箱型图 注释
import seaborn as sns
sns.boxplot(x = 'item_condition_id', y = np.log(df['price']+1), data = df, palette = sns.color_palette('RdBu',5))
<matplotlib.axes._subplots.AxesSubplot at 0x127d5bdd8>
- 不同的物品状态对应的价格千差外别
- settings
"There are %d items that do not have a category name" % df['category_name'].isnull().sum()
'There are 6327 items that do not have a category name'
"There are %d items that do not have a brand name" % df['brand_name'].isnull().sum()
'There are 632682 items that do not have a brand name'
"There are %d items that do not have a item_description " % df['item_description'].isnull().sum()
'There are 4 items that do not have a item_description '
def handling_missing_inplace(datasets):
datasets['item_description'].replace('No description yet,''missing', inplace=True) # 需要仔细看数据才能看到
datasets['item_description'].fillna(value='missing', inplace=True)
def cutting(datasets):
pop_brand = datasets['brand_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_BRANDS]
datasets.loc[~datasets['brand_name'].isin(pop_brand),'brand_name'] ='missing'
pop_category = datasets['category_name'].value_counts().loc[lambda x:x.index!='missing'].index[:NUM_CATEGORIES]
datasets.loc[~datasets['category_name'].isin(pop_category),'category_name'] ='missing'
def to_category(datasets):
datasets['category_name'] = datasets['category_name'].astype('category')
datasets['brand_name'] = datasets['brand_name'].astype('category')
datasets['item_condition_id'] = datasets['item_condition_id'].astype('category')
- 查看价格的数量分布,发现竟然有价格为0的,所以需要去掉价格为0的数据
index | price | |
25 | 3.0 | 18703 |
28 | 4.0 | 16139 |
17 | 5.0 | 31502 |
261 | 5.5 | 33 |
16 | 6.0 | 32260 |
train_id | name | item_condition_id | category_name | brand_name | price | shipping | item_description | |
0 | 0 | MLB Cincinnati Reds T Shirt Size XL | 3 | Men/Tops/T-shirts | NaN | 10.0 | 1 | No description yet |
1 | 1 | Razer BlackWidow Chroma Keyboard | 3 | Electronics/Computers & Tablets/Components & P... | Razer | 52.0 | 0 | This keyboard is in great condition and works ... |
2 | 2 | AVA-VIV Blouse | 1 | Women/Tops & Blouses/Blouse | Target | 10.0 | 1 | Adorable top with a hint of lace and a key hol... |
3 | 3 | Leather Horse Statues | 1 | Home/Home Décor/Home Décor Accents | NaN | 35.0 | 1 | New with tags. Leather horses. Retail for [rm]... |
4 | 4 | 24K GOLD plated rose | 1 | Women/Jewelry/Necklaces | NaN | 44.0 | 0 | Complete with certificate of authenticity |
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelBinarizer
import lightgbm as lgb
from scipy.sparse import csr_matrix, hstack # 解决稀疏矩阵
# referenc
import gc
import time
from sklearn.linear_model import Ridge
def main():
start_time = time.time()
train = pd.read_table('input/train.tsv', engine='c')
# train=train[train['price']!=0]
test = pd.read_table('input/test_stg2.tsv', engine='c')
print('[{}] Finished to load data'.format(time.time() - start_time))
print('Train shape: ', train.shape)
print('Test shape: ', test.shape)
nrow_train = train.shape[0]
y = np.log1p(train["price"])
merge: pd.DataFrame = pd.concat([train, test])
submission: pd.DataFrame = test[['test_id']]
del train
del test
print('[{}] Finished to handle missing'.format(time.time() - start_time))
print('[{}] Finished to cut'.format(time.time() - start_time))
print('[{}] Finished to convert categorical'.format(time.time() - start_time))
cv = CountVectorizer(min_df=NAME_MIN_DF)
X_name = cv.fit_transform(merge['name'])
print('[{}] Finished count vectorize `name`'.format(time.time() - start_time))
cv = CountVectorizer()
X_category = cv.fit_transform(merge['category_name'])
print('[{}] Finished count vectorize `category_name`'.format(time.time() - start_time))
tv = TfidfVectorizer(max_features=MAX_FEATURES_ITEM_DESCRIPTION,
ngram_range=(1, 3),
X_description = tv.fit_transform(merge['item_description'])
print('[{}] Finished TFIDF vectorize `item_description`'.format(time.time() - start_time))
lb = LabelBinarizer(sparse_output=True)
X_brand = lb.fit_transform(merge['brand_name'])
print('[{}] Finished label binarize `brand_name`'.format(time.time() - start_time))
X_dummies = csr_matrix(pd.get_dummies(merge[['item_condition_id', 'shipping']],
print('[{}] Finished to get dummies on `item_condition_id` and `shipping`'.format(time.time() - start_time))
sparse_merge = hstack((X_dummies, X_description, X_brand, X_category, X_name)).tocsr()
print('[{}] Finished to create sparse merge'.format(time.time() - start_time))
X = sparse_merge[:nrow_train]
X_test = sparse_merge[nrow_train:]
#train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size = 0.1, random_state = 144)
d_train = lgb.Dataset(X, label=y)
#d_valid = lgb.Dataset(valid_X, label=valid_y, max_bin=8192)
#watchlist = [d_train, d_valid]
params = {
'learning_rate': 0.73,
'application': 'regression',
'max_depth': 3,
'num_leaves': 100,
'verbosity': -1,
'metric': 'RMSE',
model = lgb.train(params, train_set=d_train, num_boost_round=3000, verbose_eval=100)
preds = 0.56*model.predict(X_test)
model = Ridge(solver="sag", fit_intercept=True, random_state=42), y)
print('[{}] Finished to train ridge'.format(time.time() - start_time))
preds += 0.44*model.predict(X=X_test)
print('[{}] Finished to predict ridge'.format(time.time() - start_time))
submission['price'] = np.expm1(preds)
submission.loc[submission['price'] < 0.0, 'price'] = 0.0
submission.to_csv("sample_submission_stg2.csv", index=False)
if __name__ == '__main__':