zoukankan      html  css  js  c++  java
  • Kaggle的Outbrain点击预测比赛分析

    https://yq.aliyun.com/articles/293596

    https://www.kaggle.com/c/outbrain-click-prediction

    https://www.kaggle.com/anokas/outbrain-eda

    用户个性化点击率预估

    基本场景:

    document_id(document)  uuid(user)  ad_id(a set of ads)

    原始数据:

    page_views.csv: the log of users visiting documents

    • uuid
    • document_id
    • timestamp (ms since 1970-01-01 - 1465876799998)
    • platform (desktop = 1, mobile = 2, tablet =3)
    • geo_location (country>state>DMA)
    • traffic_source (internal = 1, search = 2, social = 3)

    clicks_train.csv:

    • display_id
    • ad_id
    • clicked (1 if clicked, 0 otherwise)

    events.csv: (information on the display_id context)

    • display_id
    • uuid
    • document_id
    • timestamp
    • platform
    • geo_location

    promoted_content.csv: details on the ads.

    • ad_id
    • document_id
    • campaign_id
    • advertiser_id

    documents_meta.csv: details on the documents.

    • document_id
    • source_id (the part of the site on which the document is displayed, e.g. edition.cnn.com)
    • publisher_id
    • publish_time

    documents_topics.csv, documents_entities.csv, and documents_categories.csv all provide information about the content in a document, as well as Outbrain's confidence in each respective relationship. 

    数据分析

    import pandas as pd 
    import os
    import gc # We're gonna be clearing memory a lot
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline
    
    df_train = pd.read_csv('./outbrain-click-prediction/clicks_train.csv')
    df_test = pd.read_csv('./outbrain-click-prediction/clicks_test.csv')
    
    # 页面广告数分布
    size_train = df_train.groupby('display_id')['ad_id'].count().value_counts()
    size_train = size_train / np.sum(size_train)

    直方图:

    plt.figure(figsize=(12,4))
    p = sns.color_palette()
    sns.barplot(size_train.index, size_train.values, alpha=0.8, color=p[0], label='train')
    plt.legend()
    plt.xlabel('Number of Ads in display', fontsize=12)
    plt.ylabel('Proportion of set', fontsize=12)

    统计广告出现次数:

    # 以下两行都可以
    df_train.groupby('ad_id')['ad_id'].count()
    df_train.groupby('ad_id').agg(np.size) 

    统计训练集和测试集中ad的重合度:

    len(set(df_test.ad_id.unique()).intersection(df_train.ad_id.unique())) / len(df_test.ad_id.unique())

    对events.csv进行统计:

    print (events.columns.to_list())
    print (events.head())
    print (events.platform.value_counts())
    events.platform = events.platform.astype(str)
    print (events.platform.value_counts())
    
    print (events.groupby('uuid')['uuid'].count().sort_values()) # 统计用户的出现次数
  • 相关阅读:
    最佳调度问题_分支限界法
    运动员最佳配对问题
    最小重量机器设计问题
    实现银行家算法和先进先出算法_对文件读写数据
    n皇后问题_回溯法
    0-1背包_回溯法
    根据前序、中序、后序遍历还原二叉树
    矩阵连乘问题_动态规划
    最长公共子序列_动态规划
    最优二叉查找树_动态规划
  • 原文地址:https://www.cnblogs.com/ljygoodgoodstudydaydayup/p/10456935.html
Copyright © 2011-2022 走看看