https://yq.aliyun.com/articles/293596
https://www.kaggle.com/c/outbrain-click-prediction
https://www.kaggle.com/anokas/outbrain-eda
用户个性化点击率预估
基本场景:
document_id(document) uuid(user) ad_id(a set of ads)
原始数据:
page_views.csv: the log of users visiting documents
- uuid
- document_id
- timestamp (ms since 1970-01-01 - 1465876799998)
- platform (desktop = 1, mobile = 2, tablet =3)
- geo_location (country>state>DMA)
- traffic_source (internal = 1, search = 2, social = 3)
clicks_train.csv:
- display_id
- ad_id
- clicked (1 if clicked, 0 otherwise)
events.csv: (information on the display_id context)
- display_id
- uuid
- document_id
- timestamp
- platform
- geo_location
promoted_content.csv: details on the ads.
- ad_id
- document_id
- campaign_id
- advertiser_id
documents_meta.csv: details on the documents.
- document_id
- source_id (the part of the site on which the document is displayed, e.g. edition.cnn.com)
- publisher_id
- publish_time
documents_topics.csv, documents_entities.csv, and documents_categories.csv all provide information about the content in a document, as well as Outbrain's confidence in each respective relationship.
数据分析:
import pandas as pd import os import gc # We're gonna be clearing memory a lot import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline df_train = pd.read_csv('./outbrain-click-prediction/clicks_train.csv') df_test = pd.read_csv('./outbrain-click-prediction/clicks_test.csv') # 页面广告数分布 size_train = df_train.groupby('display_id')['ad_id'].count().value_counts() size_train = size_train / np.sum(size_train)
直方图:
plt.figure(figsize=(12,4)) p = sns.color_palette() sns.barplot(size_train.index, size_train.values, alpha=0.8, color=p[0], label='train') plt.legend() plt.xlabel('Number of Ads in display', fontsize=12) plt.ylabel('Proportion of set', fontsize=12)
统计广告出现次数:
# 以下两行都可以 df_train.groupby('ad_id')['ad_id'].count() df_train.groupby('ad_id').agg(np.size)
统计训练集和测试集中ad的重合度:
len(set(df_test.ad_id.unique()).intersection(df_train.ad_id.unique())) / len(df_test.ad_id.unique())
对events.csv进行统计:
print (events.columns.to_list()) print (events.head()) print (events.platform.value_counts()) events.platform = events.platform.astype(str) print (events.platform.value_counts()) print (events.groupby('uuid')['uuid'].count().sort_values()) # 统计用户的出现次数