zoukankan html css js c++ java

[Tianchi] Repeat Buyers Prediction-Challenge the Baseline -- version 0

Introduction
EDA
Different Machine Learning Models
Model Blending by Stacking
Summary

Introduction

天池比赛项目，作为练习，这是第一个版本。赛题内容如下：

商家有时会在特定日期（例如“Boxing-day” ，“ 黑色星期五” 或“ 双11” ）进行大促销（例如折扣或现金券），以吸引大量新买家。许多吸引的买家都是一次性交易猎人，这些促销可能对销售产生很小的长期影响。为了缓解这个问题，商家必须确定谁可以转换为重复买家。通过瞄准这些潜力忠诚的客户，商家可以大大降低促销成本，提高投资回报率（ROI ）。

众所周知，在线广告领域，客户定位极具挑战性，特别是对于新买家而言。但是通过Tmall.com 长期积累的用户行为日志，我们或许可以解决这个问题。在这个挑战中，我们提供了一套商家及其在“双11 ”日促销期间获得的相应新买家。你的任务是预测对于指定商家的新买家将来是否会成为忠实客户。换句话说，您需要预测这些新买家在6 个月内再次从同一商家购买商品的概率。一个包含大约20 万用户的数据集用于训练，还有一个类似大小的数据集用于测试。与其他比赛类似，您可以提取任何特征，然后使用其他工具进行训练。您只需提交预测结果进行评估。

EDA

./data_format1/中

train_format1.csv提供了训练数据的（用户id, 商家id, 标签）；
test_format1.csv提供了测试数据的（用户id, 商家id）；
user_info_format1.csv提供了所有用户的年龄和性别信息；
user_log_format1.csv提供了所有用户的行为，包括：
- 用户id, 商家id,
- 该用户在该商家所感兴趣的：
  - 商品id, 类别id, 品牌id, 行为日期（time_stamp）,行为类别（包括0点击，1加入购物车，2购买，3收藏该店铺）

./data_format2/中包括的信息是一样的，但表达形式不同

train_format2.csv中的标签有-1,0,1，除了1之外，这里的-1表示该条数据不作为训练数据使用，但也记录了用户的行为，这些数据都被整合在了user_log_format1.csv中

特征组合

import pandas as pd
import numpy as np
train_feature = pd.read_csv('./traindata_features.csv')
features = list(train_feature.columns[3:])
print(features)

['num_action0', 'num_action1', 'num_action2', 'num_action3', 'ttl_items', 'ttl_cats', 'ttl_brands', 'bought_item_types', 'bought_cat_types', 'bought_brand_types', 'addtocart', 'addtofavorite', 'age__0.0', 'age__1.0', 'age__2.0', 'age__3.0', 'age__4.0', 'age__5.0', 'age__6.0', 'age__7.0', 'age__8.0', 'gender__0.0', 'gender__1.0', 'gender__2.0', 'ttlnum_item_types_in', 'ttlnum_cat_types_in', 'ttlnum_brand_types_in', 'ttlnum_item_types_all', 'ttlnum_cat_types_all', 'num_action0_in', 'num_action1_in', 'num_action2_in', 'num_buyers_in', 'num_buyers_buymore_in', 'num_action0_userin', 'num_action1_userin', 'num_action2_userin', 'num_action3_userin', 'num_action0_userall', 'num_action1_userall', 'num_action2_userall', 'num_action3_userall', 'num_items', 'item_boughtFreq_old', 'item_boughtFreq_new', 'item_boughtFreq_all', 'cat_boughtFreq_old', 'cat_boughtFreq_new', 'cat_boughtFreq_all', 'brand_boughtFreq_old', 'brand_boughtFreq_new', 'brand_boughtFreq_all', 'usr_buy_days', 'num_repeatbuy', '30days_repeatusers_cnt_max', '30days_repeatusers_cnt_min', '30days_repeatusers_cnt_avg', '60days_repeatusers_cnt_max', '60days_repeatusers_cnt_min', '60days_repeatusers_cnt_avg', '90days_repeatusers_cnt_max', '90days_repeatusers_cnt_min', '90days_repeatusers_cnt_avg']

len(features)

对于每一对（用户, 商家），得到特征：

(uid, sid): #(点击数), #(加入购物车数), #(购买数), #(收藏数).
(uid, sid): #(涵盖商品数目), #(涵盖商品类别), #(涵盖品牌).
(uid, sid): #(购买的商品数目), #(购买的商品类别), #(购买的品牌数目).
(uid, sid): ?(是否加入购物车), ?(是否收藏)

对于每一个(商家)，得到特征:

训练集中出现的：商品种类总数，类别种类总数，品牌种类总数；
用户日志中出现过的：商品种类总数，类别种类总数，品牌种类总数。
训练集中出现的:点击总数，加入购物车总数，购买总数，收藏总数;
用户日志中出现过的:点击总数，加入购物车总数，购买总数，收藏总数.
训练集中出现过的：买家总数，购买次数2次或以上的买家总数；
用户日志中出现过的：买家总数，购买次数2次或以上的买家总数。

通过相关性分析，我们只保留相关系数小于0.9的特征组合：

训练集中出现的：商商品种类总数，类别种类总数，品牌种类总数；
用户日志中出现过的：商商品种类总数，类别种类总数，品牌种类总数。
训练集中出现的:点击总数，加入购物车总数，购买总数。
训练集中出现过的：买家总数，购买次数2次或以上的买家总数.

对于每一个(用户)，得到特征:

年龄，性别，进行独热向量编码。
训练集中出现的:点击总数，加入购物车总数，购买总数，收藏总数;
用户日志中出现过的:点击总数，加入购物车总数，购买总数，收藏总数.

通过相关性分析，以上特征相关系数均小于0.9，因此都保留了。

针对购买的商品，可以提取以下特征:

对训练集/测试集中的每一对(用户，商家)，计算
- num_items: 购买商品次数。
- item_boughtFreq_old：所购买的商品在用户日志中（label=-1）被购买的频率，取最大值（还可以加入最小值、标准差，但是实验的时候忘记了）；
- item_boughtFreq_new: 所购买的商品在新增样本中（label=0，1）被购买的频率，取最大值；
- item_boughtFreq_all: 所购买的商品在全部样本中（label=-1，0，1）被购买的频率，取最大值；
- 上述三个特征分别应用于所购买的商品类别和品牌，又可以得到6个特征: 'cat_boughtFreq_old', 'cat_boughtFreq_new', 'cat_boughtFreq_all', 'brand_boughtFreq_old', 'brand_boughtFreq_new', 'brand_boughtFreq_all'.

利用时间戳提取特征：

用户有购买行为的天数；
一个用户重复购买的次数。
对于一对（用户、商家）：统计30天,60天,90天，类别被购买次数的最大、最小、标准差，可以看是否为常用各类别。

目前为止，得到共计63个特征。

缺失值处理

在该数据集中，存在缺失值的有年龄、性别、品牌。因年龄和性别的确实比例很少（1%左右），我们使用众数填充。
对于品牌id，可以根据其关联的商品和类别进行填充，从相同（用户，商品、类别、商家，时间戳）开始找，逐渐减少对照列，直到找到非空集合，使用众数填充。

def fill_brand_id_nan(datalog):
    ix_group = datalog[datalog.brand_id.isnull()].ix[:, ['user_id','seller_id','cat_id', 'item_id', 'time_stamp']]
    ix_nan = [{'uid': x[1].user_id, 'sid': x[1].seller_id, 'cid':x[1].cat_id, 'tid': x[1].item_id, 'time':x[1].time_stamp} for x in ix_group.iterrows()]

    tmp1 = datalog.groupby(['user_id', 'seller_id','cat_id', 'item_id', 'time_stamp'])
    tmp2 = datalog.groupby(['user_id', 'seller_id', 'cat_id', 'item_id'])
    tmp3 = datalog.groupby(['seller_id', 'cat_id'])
    tmp4 = datalog.groupby(['seller_id', 'item_id'])
    tmp5 = datalog.groupby(['seller_id'])
    
    replace_value = []
    for xindex, x in ix_group.iterrows():
        rv = tmp1.get_group((x.user_id, x.seller_id, x.cat_id, x.item_id, x.time_stamp)).brand_id.mode()
        if rv.empty:
            rv = tmp2.get_group((x.user_id, x.seller_id, x.cat_id, x.item_id)).brand_id.mode()
            if rv.empty:
                rv = tmp3.get_group((x.seller_id, x.cat_id)).brand_id.mode()
                if rv.empty:
                    rv = tmp4.get_group((x.seller_id, x.item_id)).brand_id.mode()
                    if rv.empty:
                        rv = tmp5.get_group((x.seller_id)).brand_id.mode()
        try:
            replace_value.append(rv)
            datalog.brand_id[xindex] = rv[0]
        except:        
            print('Error')
            print(tmp1.get_group((x.user_id, x.seller_id, x.cat_id, x.item_id, x.time_stamp)))
            print(tmp2.get_group((x.user_id, x.seller_id, x.cat_id, x.item_id)))
            break

    return datalog

对特征进行归一化/标准化处理：

from sklearn.preprocessing import MinMaxScaler,StandardScaler

x = train_feature[features]

minmax = MinMaxScaler()
minmax.fit(x.as_matrix())
x_mm = minmax.transform(x.as_matrix())
x_mm = pd.DataFrame(x_mm, columns = x.columns)

scaler = StandardScaler()
scaler.fit(x.as_matrix())
x_scal = scaler.transform(x.as_matrix())
x_scal = pd.DataFrame(x_scal, columns = x.columns)

from sklearn.model_selection import train_test_split
y = train_feature['label']
X_train, X_test, y_train, y_test = train_test_split(x_mm[features].values, y.values, test_size=0.25)

特征重要性

from sklearn.externals import joblib
clf_xgb = joblib.load('./model/model_xgb_X_train_feat63_0.6849504_583rounds.m')

from xgboost import plot_importance
from matplotlib import pyplot as plt
%matplotlib inline
fig,axes = plt.subplots(figsize=(10,15))
plot_importance(clf_xgb,ax=axes)

featdict = dict(zip(clf_xgb.feature_names,features))
yticklabel = [d.get_text() for d in list(axes.get_yticklabels())]
toset_yticklabel = [featdict[y]for y in yticklabel]

axes.set_yticklabels(toset_yticklabel)
# axes.get_yticklabels
plt.show()

可以看出，重要度排在前面的有:

对于每一个（用户），用户日志中出现过的收藏总数；
对于每一对（用户，商家），所购买的商品在用户日志中（label=-1）被购买的频率的最大值；
对于每一对（用户，商家），所购买的商品在全部样本中（label=-1，0，1）被购买的频率的最大值；
对于每一对（用户，商家），所购买的品牌在全部样本中（label=-1，0，1）被购买的频率的最大值；
对于每一对（用户，商家），所购买的商品在新增样本中（label=0，1）被购买的频率的最大值；
对于每一个（商家），用户日志中出现过的商品种类总数；
对于每一对（用户，商家），所购买的类别在新增样本中（label=0，1）被购买的频率的最大值；
对于每一个（商家），训练集样本中出现过的商品种类总数。
对于每一个（用户），用户日志中出现过的加入购物车总数；
对于每一对（用户，商家），所购买的类别在全部样本中（label=-1，0，1）被购买的频率的最大值；
对于每一对（用户，商家），所购买的品牌在用户日志中（label=-1）被购买的频率的最大值；
对于每一个（用户），用户日志中出现过的点击总数；
...

如果调整num_boost_round, 会使特征重要度排名稍微改变，但总体变化并不大。

该模型的训练过程如下：

xgb_params = {
    # general parameters
    'booster': 'gbtree',
    'silent': 1,

    # booster parameters
    'learning_rate': 0.05,
    'gamma': 0,
    'max_depth': 4,
    'min_child_weight': 1,
    'max_delta_step': 0,    
    'subsample': 0.5,
    'colsample_bytree': 1.0,

    # task parameters
    'objective': 'reg:logistic',
    'eval_metric': 'auc',
    'seed': 0,
}
x = train_feature[features]
y = train_feature['label']
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.25) #data is split in a stratified fashion
dtrain = xgb.DMatrix(X_train[features].values,y_train['label'].values)
clf_xgb = xgb.train(xgb_params, dtrain=dtrain, num_boost_round = 583)

# num_boost_round = 583: 
#       用相同的参数跑了交叉验证（xgb.cv），cv_num=5，在第583轮的时候测试集平均auc最高。

#data is split in a stratified fashion

joblib.dump(toset_yticklabel,'./model/feat_importance1.m')

['./model/feat_importance1.m']

Different Machine Learning Models

XGBoost

import xgboost as xgb
njobs = 2
xgb1 = xgb.XGBClassifier(scale_pos_weight = 15.3, 
                         n_estimators = 600, 
                         learning_rate = 0.05,
                         max_depth = 4, 
                         min_child_weight = 1,
                         subsample = 0.5,
                         gamma = 0,
                         colsample_bytree = 1,
                         objective = 'reg:logistic', 
                         eval_metric = 'auc',
                         seed = 0,
                         nthread = njobs)
xgb1.fit(X_train, y_train)
probas_xgb = xgb1.predict_proba(X_test)
pred_xgb = xgb1.predict(X_test)


from sklearn.metrics import roc_auc_score
# y_test = y_test.tolist()
auc_xgb = roc_auc_score(y_test, pred_xgb)

D:UsersgaoAnaconda3libsite-packagessklearnpreprocessinglabel.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:

Random Forest

from sklearn.ensemble import RandomForestClassifier
njobs = 2
rf1 = RandomForestClassifier(class_weight={0: 1, 1: 15.3}, 
                             criterion = 'gini', 
                             max_depth = 50, 
                             max_features = 'sqrt', 
                             max_leaf_nodes = None, 
                             min_samples_leaf = 20, 
                             min_samples_split = 50, 
                             n_estimators = 750, 
                             random_state = 0)
rf1.fit(X_train, y_train)
probas_rf = rf1.predict_proba(X_test)
pred_rf = rf1.predict(X_test)


from sklearn.metrics import roc_auc_score
# y_test = y_test.tolist()
auc_rf = roc_auc_score(y_test, pred_rf)

GBDT

from sklearn.ensemble import GradientBoostingClassifier
njobs = 2
gbdt1 = GradientBoostingClassifier(learning_rate = 0.05,
                                  subsample = 0.5, 
                                  max_depth = 5,
                                  n_estimators = 100,
                                  loss = 'deviance', 
                                  random_state = 0,
                                  min_samples_split = 100,
                                  min_samples_leaf = 120,
                                  max_features = 'sqrt',
                                  )

gbdt1.fit(X_train, y_train)
probas_gbdt = gbdt1.predict_proba(X_test)
pred_gbdt = gbdt1.predict(X_test)


from sklearn.metrics import roc_auc_score
# y_test = y_test.tolist()
auc_gbdt = roc_auc_score(y_test, pred_gbdt)

Plot ROC Curve

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt 
import numpy as np

fpr, tpr, thresholds = roc_curve(y_test, probas_rf[:, 1])
roc_auc = auc(fpr, tpr)
print('rf AUC:', roc_auc)
plt.plot(fpr, tpr, 'b-', label = 'random forest')

fpr, tpr, thresholds = roc_curve(y_test, probas_xgb[:, 1])
roc_auc = auc(fpr, tpr)
print('xgb AUC:', roc_auc)
plt.plot(fpr, tpr, 'g-.', label= 'xgboost')

fpr, tpr, thresholds = roc_curve(y_test, probas_gbdt[:, 1])
roc_auc = auc(fpr, tpr)
print('gbdt AUC:', roc_auc)
plt.plot(fpr, tpr, 'g-.', label= 'gbdt')
    
score_all = (np.array(probas_rf[:, 1]) + np.array(probas_xgb[:, 1]) + np.array(probas_gbdt[:, 1]))*0.5
fpr, tpr, thresholds = roc_curve(y_test, score_all)
roc_auc = auc(fpr, tpr)
print('AUC:', roc_auc)
plt.plot(fpr, tpr, 'r', label = 'ensemble')
    
plt.plot([1,0],[0,1], 'k-.')

plt.ylabel('FPR')
plt.xlabel('TPR')
plt.title('ROC curve')
plt.legend(loc="lower right")

# plt.savefig('ROC.png', dpi =1000)

rf AUC: 0.6796455895487219
xgb AUC: 0.6828011772216968
gbdt AUC: 0.6670128952067169
AUC: 0.6861724266190854





<matplotlib.legend.Legend at 0x2a816cca898>

调参

def train_optimal_classifier(clf,param,xdata,ydata):
    grid_search = GridSearchCV(
        clf,
        param_grid=param,
        cv=cv_num,
        verbose=2,
        scoring='roc_auc',
        error_score=0,
        refit=True ,
        n_jobs=njobs
        )

    grid_search.fit(xdata, ydata)
    print("Best parameters")
    print(grid_search.best_params_)
    print('best auc: ')
    print(grid_search.best_score_)
    return (grid_search.best_estimator_)

Model Blending by Stacking

将train data划分后，在训练集上交叉验证+模型融合（xgboost->random forest->GBDT），在测试集上预测结果：

clfs = [xgb1, rf1, gbdt1]

X_train, X_test, y_train, y_test = train_test_split(x_mm[features].values, y.values, test_size=0.25)
dataset_blend_train = np.zeros((X_train.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_test.shape[0], len(clfs)))
# '''5fold stacking'''
n_folds = 5
skf = list(StratifiedKFold(y_train, n_folds))
for j, clf in enumerate(clfs):
#    '''依次训练各个单模型'''
    print(j, clf)
    
    dataset_blend_test_j = np.zeros((X_test.shape[0], len(skf)))
    
    for i, (train, test) in enumerate(skf):
#        '''使用第i个部分作为预测，剩余的部分来训练模型，获得其预测的输出作为第i部分的新特征。'''
        print("Fold", i)
        X_ttrain, y_ttrain, X_ttest, y_ttest = X_train[train], y_train[train], X_train[test], y_train[test]
        clf.fit(X_ttrain, y_ttrain)
        y_submission = clf.predict_proba(X_ttest)[:, 1]
        dataset_blend_train[test, j] = y_submission
        dataset_blend_test_j[:, i] = clf.predict_proba(X_test)[:, 1]
#    '''对于测试集，直接用这k个模型的预测值均值作为新的特征。'''
    dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)
    print("val auc Score: %f" % roc_auc_score(y_test, dataset_blend_test[:, j]))


# clf = LogisticRegression()
clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(dataset_blend_train, y_train)
y_submission = clf.predict_proba(dataset_blend_test)[:, 1]

print("Linear stretch of predictions to [0,1]")
y_submission = (y_submission - y_submission.min()) / (y_submission.max() - y_submission.min())
print("blend result")
print("val auc Score: %f" % (roc_auc_score(y_test, y_submission)))

一次运行后的输出如下，比相同参数下的各个模型单独训练的结果要好，进行模型融合auc后也有所提升:

0 XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
   colsample_bytree=1, eval_metric='auc', gamma=0, learning_rate=0.05,
   max_delta_step=0, max_depth=4, min_child_weight=1, missing=None,
   n_estimators=600, n_jobs=1, nthread=30, objective='reg:logistic',
   random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=15.3,
   seed=0, silent=True, subsample=0.5)

Fold 0

Fold 1

Fold 2

Fold 3

Fold 4

val auc Score: 0.689373

1 RandomForestClassifier(bootstrap=True, class_weight={0: 1, 1: 15.3},

criterion='gini', max_depth=50, max_features='sqrt',

max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, min_samples_leaf=20,

min_samples_split=50, min_weight_fraction_leaf=0.0,
n_estimators=750, n_jobs=30, oob_score=False, random_state=0,

verbose=0, warm_start=False)

Fold 0

Fold 1

Fold 2

Fold 3

Fold 4

val auc Score: 0.685798

2 GradientBoostingClassifier(criterion='friedman_mse', init=None,

learning_rate=0.05, loss='deviance', max_depth=5,

max_features='sqrt', max_leaf_nodes=None,

min_impurity_decrease=0.0, min_impurity_split=None,

min_samples_leaf=120, min_samples_split=100,

min_weight_fraction_leaf=0.0, n_estimators=100,

presort='auto', random_state=0, subsample=0.5, verbose=0,

warm_start=False)

Fold 0

Fold 1

Fold 2

Fold 3

Fold 4

val auc Score: 0.673720

Linear stretch of predictions to [0,1]

blend result

val auc Score: 0.692262

Summary

Reference

基于阿里巴巴大数据重复购买预测的实证研究，王克利，邓飞其，http://www.cnki.com.cn/Article/CJFDTotal-YNJR201803154.htm
Ensemble Learning-模型融合-Python实现 by 立刻有, https://blog.csdn.net/shine19930820/article/details/75209021#3-python实现

查看全文

相关阅读:
基于Lucene/XML的站内全文检索解决方案
 内容管理系统(CMS)的设计和选型
 Lucene入门与使用[转]
为自己的系统搞个全文搜索参考值：2 (转)
C# 时间函数
 Lucene倒排索引原理(转)
什么是内容管理系统CMS？
网络测试常用命令
 C#与C的区别
 人生致命的八个经典问题

原文地址：https://www.cnblogs.com/expttt/p/9432725.html