zoukankan      html  css  js  c++  java
  • “魔镜杯”风控算法大赛

    比赛概览

    拍拍贷“魔镜风控系统”从平均400个数据维度评估用户当前的信用状态,给每个借款人打出当前状态的信用分,在此基础上,再结合新发标的信息,打出对于每个标的6个月内逾期率的预测,为投资人提供了关键的决策依据,促进健康高效的互联网金融。拍拍贷首次开放丰富而真实的历史数据,邀你PK“魔镜风控系统”,通过机器学习技术,你能设计出更具预测准确率和计算性能的违约预测算法吗?

    比赛规则

    参赛团队需要基于训练集数据构建预测模型,使用模型计算测试集的评分(评分数值越高,表示越有可能出现贷款违约)。

    模型评价标准:本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴,True Positive Rate为纵轴的ROC (Receiver Operating Characteristic)curve下方的面积的大小。

    比赛数据

    本次大赛将公开国内网络借贷行业的贷款风险数据,包括信用违约标签(因变量)、建模所需的基础与加工字段(自变量)、相关用户的网络行为原始数据。本着保护借款人隐私以及拍拍贷知识产权的目的,数据字段已经过脱敏处理。

    数据编码为GBK。初赛数据包括3万条训练集和2万条测试集。复赛会增加新的3万条数据,供参赛团队优化模型,并新增1万条数据作为测试集。所有训练集,测试集都包括3个csv文件。

    初赛数据下载链接

    复赛数据下载链接

    数据类型说明文档下载链接

    Master

    每一行代表一个样本(一笔成功成交借款),每个样本包含200多个各类字段。

    • idx:每一笔贷款的unique key,可以与另外2个文件里的idx相匹配。
    • UserInfo_*:借款人特征字段
    • WeblogInfo_*:Info网络行为字段
    • Education_Info*:学历学籍字段
    • ThirdParty_Info_PeriodN_*:第三方数据时间段N字段
    • SocialNetwork_*:社交网络字段
    • LinstingInfo:借款成交时间
    • Target:违约标签(1 = 贷款违约,0 = 正常还款)。测试集里不包含target字段。

    Log_Info

    借款人的登陆信息。

    • ListingInfo:借款成交时间
    • LogInfo1:操作代码
    • LogInfo2:操作类别
    • LogInfo3:登陆时间
    • idx:每一笔贷款的unique key

    Userupdate_Info

    借款人修改信息

    • ListingInfo1:借款成交时间
    • UserupdateInfo1:修改内容
    • UserupdateInfo2:修改时间
    • idx:每一笔贷款的unique key
    # 引入Package
    import numpy as np
    import pandas as pd
    from pandas import Series, DataFrame
    
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    import seaborn as sns
    sns.set(style='whitegrid')
    
    import arrow
    
    # 用arrow lib,把日期解析成年、月、日、周、星期几、月初/月中/月末。带入模型前进行one-hot encoding
    def parse_date(date_str, str_format='YYYY/MM/DD'):
        d = arrow.get(date_str, str_format)
        # 月初,月中,月末
        month_stage = int((d.day-1) / 10) + 1
        return (d.timestamp, d.year, d.month, d.day, d.week, d.isoweekday(), month_stage)
    
    # 显示列名
    def show_cols(df):
        for c in df.columns:
            print(c)
    

    读取数据

    path = '/Training Set'
    # path = './PPD-First-Round-Data-Update/Training Set'
    train_master = pd.read_csv('PPD_Training_Master_GBK_3_1_Training_Set.csv', encoding='gbk')
    train_loginfo = pd.read_csv('PPD_LogInfo_3_1_Training_Set.csv', encoding='gbk')
    train_userinfo = pd.read_csv('PPD_Userupdate_Info_3_1_Training_Set.csv', encoding='gbk')
    

    数据清洗

    • 删除数据缺失比例很大的列,比如超过20%为nan
    • 删除数据缺失比例大的行,并保持删除的行数不超过总体的1%
    • 填补剩余缺失值,通过value_count观察是连续/离散变量,然后用最高频/平均数填补nan。这里通过观察,而不是判断类型是否object,更贴近实际情况
    # train_master中每一列的NULL值数量
    null_sum = train_master.isnull().sum()
    # train_master中每一列的NULL值数量不为0的
    null_sum = null_sum[null_sum!=0]
    null_sum_df = DataFrame(null_sum, columns=['num'])
    # 缺失率
    null_sum_df['ratio'] = null_sum_df['num'] / 30000.0
    null_sum_df.sort_values(by='ratio', ascending=False, inplace=True)
    print(null_sum_df.head(10))
    
    # 删除缺失严重的列
    train_master.drop(['WeblogInfo_3', 'WeblogInfo_1', 'UserInfo_11', 'UserInfo_13', 'UserInfo_12', 'WeblogInfo_20'],
                      axis=1, inplace=True)
    
                     num     ratio
    WeblogInfo_3   29030  0.967667
    WeblogInfo_1   29030  0.967667
    UserInfo_11    18909  0.630300
    UserInfo_13    18909  0.630300
    UserInfo_12    18909  0.630300
    WeblogInfo_20   8050  0.268333
    WeblogInfo_21   3074  0.102467
    WeblogInfo_19   2963  0.098767
    WeblogInfo_2    1658  0.055267
    WeblogInfo_4    1651  0.055033
    
    # 删除缺失严重的行
    record_nan = train_master.isnull().sum(axis=1).sort_values(ascending=False)
    print(record_nan.head())
    # 删除缺失数量>=5的行
    drop_record_index = [i for i in record_nan.loc[(record_nan>=5)].index]
    # 删除之前(30000, 222)
    print('before train_master shape {}'.format(train_master.shape))
    train_master.drop(drop_record_index, inplace=True)
    # 删除之后(29189, 222)
    print('after train_master shape {}'.format(train_master.shape))
    # len(drop_record_index)
    
    29341    33
    18637    31
    17386    31
    29130    31
    29605    31
    dtype: int64
    before train_master shape (30000, 222)
    after train_master shape (29189, 222)
    
    # 所有nan值的数量
    print('before all nan num: {}'.format(train_master.isnull().sum().sum()))
    
    # UserInfo_2为null的行,UserInfo_2列置为'位置地点'
    train_master.loc[train_master['UserInfo_2'].isnull(), 'UserInfo_2'] = '位置地点'
    # UserInfo_4为null的行,UserInfo_4列置为'位置地点'
    train_master.loc[train_master['UserInfo_4'].isnull(), 'UserInfo_4'] = '位置地点'
    
    def fill_nan(f, method):
        if method == 'most':
            # 填充为最高频
            common_value = pd.value_counts(train_master[f], ascending=False).index[0]
        else:
            # 填充为均值
            common_value = train_master[f].mean()
        train_master.loc[train_master[f].isnull(), f] = common_value
    
    # 通过pd.value_counts(train_master[f])的观察得到经验
    fill_nan('UserInfo_1', 'most')
    fill_nan('UserInfo_3', 'most')
    fill_nan('WeblogInfo_2', 'most')
    fill_nan('WeblogInfo_4', 'mean')
    fill_nan('WeblogInfo_5', 'mean')
    fill_nan('WeblogInfo_6', 'mean')
    fill_nan('WeblogInfo_19', 'most')
    fill_nan('WeblogInfo_21', 'most')
    
    print('after all nan num: {}'.format(train_master.isnull().sum().sum()))
    
    before all nan num: 0
    9725
    13478
    25688
    24185
    23997
    after all nan num: 0
    

    Feature 分类

    • 所有的分类中,如果其中最大频率的值出现超过一定阈值(50%),则把这列转化成为2值。比如[0,1,2,0,0,0,4,0,3]转化为[0,1,1,0,0,0,1,0,1]
    • 剩余的feature中,根据dtype,把所有features分为numerical和categorical 2类
    • numerical中,如果unique num不超过10个,也归属为categorical分类
    ratio_threshold = 0.5
    binarized_features = []
    binarized_features_most_freq_value = []
    
    # 不同period的third_party_feature均值汇总在一起,结果并不好,故取消
    # third_party_features = []
    
    # 遍历所有列,除去target之外的列进行如下处理
    for f in train_master.columns:
        if f in ['target']:
            continue
        # 非空值的数量
        not_null_sum = (train_master[f].notnull()).sum()
        # 特征取值最多的index
        most_count = pd.value_counts(train_master[f], ascending=False).iloc[0]
        # 特征取值最多的值
        most_value = pd.value_counts(train_master[f], ascending=False).index[0]
        # 特征取值最多的值占所有取值的比率
        ratio = most_count / not_null_sum
        # 如果大于阈值则归入二值化的特征中
        if ratio > ratio_threshold:
            binarized_features.append(f)
            binarized_features_most_freq_value.append(most_value)
    
    # 数值型特征(除去类型为object,除去'Idx', 'target',除去binarized_features)
    numerical_features = [f for f in train_master.select_dtypes(exclude = ['object']).columns 
                          if f not in(['Idx', 'target']) and f not in binarized_features]
    
    # 类别型特征(包括类型为object,除去'Idx', 'target',除去binarized_features)
    categorical_features = [f for f in train_master.select_dtypes(include = ["object"]).columns 
                            if f not in(['Idx', 'target']) and f not in binarized_features]
    
    # 遍历所有二值化特征,加名称前缀b_,将最多的取值置为0,其余的置为1
    for i in range(len(binarized_features)):
        f = binarized_features[i]
        most_value = binarized_features_most_freq_value[i]
        train_master['b_' + f] = 1
        train_master.loc[train_master[f] == most_value, 'b_' + f] = 0
        train_master.drop([f], axis=1, inplace=True)
    
    feature_unique_count = []
    # 遍历数值型特征,统计各个特征取值不为0的数量
    for f in numerical_features:
        feature_unique_count.append((np.count_nonzero(train_master[f].unique()), f))
        
    # print(sorted(feature_unique_count))
    
    # 遍历,将取值数量<=10的归为类别型特征
    for c, f in feature_unique_count:
        if c <= 10:
            print('{} moved from numerical to categorical'.format(f))
            numerical_features.remove(f)
            categorical_features.append(f)
    
    [(60, 'WeblogInfo_4'), (59, 'WeblogInfo_6'), (167, 'WeblogInfo_7'), (64, 'WeblogInfo_16'), (103, 'WeblogInfo_17'), (38, 'UserInfo_18'), (273, 'ThirdParty_Info_Period1_1'), (252, 'ThirdParty_Info_Period1_2'), (959, 'ThirdParty_Info_Period1_3'), (916, 'ThirdParty_Info_Period1_4'), (387, 'ThirdParty_Info_Period1_5'), (329, 'ThirdParty_Info_Period1_6'), (1217, 'ThirdParty_Info_Period1_7'), (563, 'ThirdParty_Info_Period1_8'), (111, 'ThirdParty_Info_Period1_11'), (18784, 'ThirdParty_Info_Period1_13'), (17989, 'ThirdParty_Info_Period1_14'), (5073, 'ThirdParty_Info_Period1_15'), (20047, 'ThirdParty_Info_Period1_16'), (14785, 'ThirdParty_Info_Period1_17'), (336, 'ThirdParty_Info_Period2_1'), (298, 'ThirdParty_Info_Period2_2'), (1192, 'ThirdParty_Info_Period2_3'), (1149, 'ThirdParty_Info_Period2_4'), (450, 'ThirdParty_Info_Period2_5'), (431, 'ThirdParty_Info_Period2_6'), (1524, 'ThirdParty_Info_Period2_7'), (715, 'ThirdParty_Info_Period2_8'), (134, 'ThirdParty_Info_Period2_11'), (21685, 'ThirdParty_Info_Period2_13'), (20719, 'ThirdParty_Info_Period2_14'), (6582, 'ThirdParty_Info_Period2_15'), (22385, 'ThirdParty_Info_Period2_16'), (18554, 'ThirdParty_Info_Period2_17'), (339, 'ThirdParty_Info_Period3_1'), (293, 'ThirdParty_Info_Period3_2'), (1172, 'ThirdParty_Info_Period3_3'), (1168, 'ThirdParty_Info_Period3_4'), (453, 'ThirdParty_Info_Period3_5'), (428, 'ThirdParty_Info_Period3_6'), (1511, 'ThirdParty_Info_Period3_7'), (707, 'ThirdParty_Info_Period3_8'), (129, 'ThirdParty_Info_Period3_11'), (21521, 'ThirdParty_Info_Period3_13'), (20571, 'ThirdParty_Info_Period3_14'), (6569, 'ThirdParty_Info_Period3_15'), (22247, 'ThirdParty_Info_Period3_16'), (18311, 'ThirdParty_Info_Period3_17'), (324, 'ThirdParty_Info_Period4_1'), (295, 'ThirdParty_Info_Period4_2'), (1183, 'ThirdParty_Info_Period4_3'), (1143, 'ThirdParty_Info_Period4_4'), (447, 'ThirdParty_Info_Period4_5'), (422, 'ThirdParty_Info_Period4_6'), (1524, 'ThirdParty_Info_Period4_7'), (706, 'ThirdParty_Info_Period4_8'), (130, 'ThirdParty_Info_Period4_11'), (20894, 'ThirdParty_Info_Period4_13'), (20109, 'ThirdParty_Info_Period4_14'), (6469, 'ThirdParty_Info_Period4_15'), (21644, 'ThirdParty_Info_Period4_16'), (17849, 'ThirdParty_Info_Period4_17'), (322, 'ThirdParty_Info_Period5_1'), (284, 'ThirdParty_Info_Period5_2'), (1144, 'ThirdParty_Info_Period5_3'), (1119, 'ThirdParty_Info_Period5_4'), (436, 'ThirdParty_Info_Period5_5'), (401, 'ThirdParty_Info_Period5_6'), (1470, 'ThirdParty_Info_Period5_7'), (685, 'ThirdParty_Info_Period5_8'), (126, 'ThirdParty_Info_Period5_11'), (20010, 'ThirdParty_Info_Period5_13'), (19145, 'ThirdParty_Info_Period5_14'), (6033, 'ThirdParty_Info_Period5_15'), (20723, 'ThirdParty_Info_Period5_16'), (17149, 'ThirdParty_Info_Period5_17'), (312, 'ThirdParty_Info_Period6_1'), (265, 'ThirdParty_Info_Period6_2'), (1074, 'ThirdParty_Info_Period6_3'), (1046, 'ThirdParty_Info_Period6_4'), (414, 'ThirdParty_Info_Period6_5'), (363, 'ThirdParty_Info_Period6_6'), (1411, 'ThirdParty_Info_Period6_7'), (637, 'ThirdParty_Info_Period6_8'), (71, 'ThirdParty_Info_Period6_9'), (15, 'ThirdParty_Info_Period6_10'), (123, 'ThirdParty_Info_Period6_11'), (95, 'ThirdParty_Info_Period6_12'), (16605, 'ThirdParty_Info_Period6_13'), (16170, 'ThirdParty_Info_Period6_14'), (5188, 'ThirdParty_Info_Period6_15'), (17220, 'ThirdParty_Info_Period6_16'), (14553, 'ThirdParty_Info_Period6_17')]
    

    Feature Engineering 特征工程

    numerical 数值型特征

    • 所有的numerical feature,画出在不同target下的分布图,stripplot(with jitter),类似于boxplot,不过更方便于大值outlier寻找
    • 绘制所有numerical features的密度图,并且可以观察出,它们都可以通过求对数转化为更接近正态分布
    • 转化为log分布后,可以再删除一些极小的outlier
    melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features])
    print(melt.head(50))
    print(melt.shape)
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
    
        target      variable      value
    0        0  WeblogInfo_4   1.000000
    1        0  WeblogInfo_4   1.000000
    2        0  WeblogInfo_4   2.000000
    3        0  WeblogInfo_4   3.027468
    4        0  WeblogInfo_4   1.000000
    5        0  WeblogInfo_4   2.000000
    6        1  WeblogInfo_4  13.000000
    7        0  WeblogInfo_4  12.000000
    8        1  WeblogInfo_4  10.000000
    9        0  WeblogInfo_4   1.000000
    10       0  WeblogInfo_4   3.000000
    11       0  WeblogInfo_4   1.000000
    12       0  WeblogInfo_4  11.000000
    13       1  WeblogInfo_4   1.000000
    14       0  WeblogInfo_4   3.000000
    15       0  WeblogInfo_4   2.000000
    16       0  WeblogInfo_4   4.000000
    17       0  WeblogInfo_4   4.000000
    18       1  WeblogInfo_4   1.000000
    19       0  WeblogInfo_4   2.000000
    20       0  WeblogInfo_4   3.000000
    21       0  WeblogInfo_4   3.000000
    22       0  WeblogInfo_4   8.000000
    23       0  WeblogInfo_4   1.000000
    24       0  WeblogInfo_4   1.000000
    25       0  WeblogInfo_4   2.000000
    26       0  WeblogInfo_4   9.000000
    27       0  WeblogInfo_4   2.000000
    28       0  WeblogInfo_4   2.000000
    29       0  WeblogInfo_4   2.000000
    30       0  WeblogInfo_4   3.000000
    31       0  WeblogInfo_4   6.000000
    32       0  WeblogInfo_4   1.000000
    33       0  WeblogInfo_4   3.000000
    34       0  WeblogInfo_4   3.027468
    35       0  WeblogInfo_4   6.000000
    36       0  WeblogInfo_4   9.000000
    37       0  WeblogInfo_4   2.000000
    38       1  WeblogInfo_4   5.000000
    39       0  WeblogInfo_4   2.000000
    40       0  WeblogInfo_4   2.000000
    41       0  WeblogInfo_4   3.000000
    42       0  WeblogInfo_4   3.027468
    43       0  WeblogInfo_4  15.000000
    44       0  WeblogInfo_4   2.000000
    45       0  WeblogInfo_4   3.000000
    46       0  WeblogInfo_4   3.000000
    47       0  WeblogInfo_4   2.000000
    48       0  WeblogInfo_4   3.000000
    49       0  WeblogInfo_4   2.000000
    (2714577, 3)
    
    
    E:Anaconda3envssklearnlibsite-packagesseabornaxisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
      warnings.warn(warning)
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x4491c80860>
    

    png

    # Seaborn画图,查看正负样本中特征取值的分布,删除离群值
    
    print('{} lines before drop'.format(train_master.shape[0]))
    
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_1 > 250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period6_2 > 400].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_2 > 250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period6_3 > 2000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_3 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period6_4 > 1500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_4 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 400)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_7 > 2000)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_6 > 1500)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 1000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1500)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_16 > 2000000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_14 > 1000000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_12 > 60)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 120) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 20) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 200000)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 150000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_15 > 40000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_17 > 130000) & (train_master.target == 0)].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_1 > 500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_2 > 500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 3000) & (train_master.target == 0)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 2000)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_5 > 500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_4 > 2000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_6 > 700].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_6 > 300) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_7 > 4000)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_8 > 800)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_11 > 200)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_13 > 200000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_14 > 150000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_15 > 75000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_16 > 180000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period5_17 > 150000].index, inplace=True)
    
    # go above
    
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_1 > 400)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_2 > 350)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_3 > 1500)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_4 > 1600].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_4 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_5 > 500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_6 > 800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_6 > 400) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_8 > 1000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_13 > 250000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_14 > 200000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_15 > 70000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_16 > 210000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period4_17 > 160000].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_1 > 400].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_2 > 380].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_3 > 1750].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_4 > 1750].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_4 > 1250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_5 > 600].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_6 > 800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_6 > 400) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_7 > 1600) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_8 > 1000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_13 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_14 > 200000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_15 > 80000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_16 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period3_17 > 150000].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_1 > 400].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_1 > 300) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_2 > 400].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_2 > 300) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_3 > 1800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_3 > 1500) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_4 > 1500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_5 > 580].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_6 > 800].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_6 > 400) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_7 > 2100].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_8 > 700) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_11 > 120].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_13 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_14 > 170000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_15 > 80000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_15 > 50000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_16 > 300000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period2_17 > 150000].index, inplace=True)
    
    
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_1 > 350].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_1 > 200) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_2 > 300].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_2 > 190) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_3 > 1500].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_4 > 1250].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_5 > 400].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_6 > 500].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_6 > 250) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_7 > 1800].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_8 > 720].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_8 > 600) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_11 > 100].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_13 > 200000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_13 > 140000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_14 > 150000].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_15 > 70000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_15 > 30000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_16 > 200000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_16 > 100000) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.ThirdParty_Info_Period1_17 > 100000].index, inplace=True)
    train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_17 > 80000) & (train_master.target == 1)].index, inplace=True)
    
    train_master.drop(train_master[train_master.WeblogInfo_4 > 40].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_6 > 40].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_7 > 150].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_16 > 50].index, inplace=True)
    train_master.drop(train_master[(train_master.WeblogInfo_16 > 25) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.WeblogInfo_17 > 100].index, inplace=True)
    train_master.drop(train_master[(train_master.WeblogInfo_17 > 80) & (train_master.target == 1)].index, inplace=True)
    train_master.drop(train_master[train_master.UserInfo_18 < 10].index, inplace=True)
    
    print('{} lines after drop'.format(train_master.shape[0]))
    
    29189 lines before drop
    28074 lines after drop
    
    # melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features if f != 'Idx'])
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.distplot, "value")
    
    E:Anaconda3envssklearnlibsite-packagesscipystatsstats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
      return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x44984f02e8>
    

    png

    # train_master_log = train_master.copy()
    numerical_features_log = [f for f in numerical_features if f not in ['Idx']]
    
    # 将数值型特征取log
    for f in numerical_features_log:
        train_master[f + '_log'] = np.log1p(train_master[f])
        train_master.drop([f], axis=1, inplace=True)
    
    
    E:Anaconda3envssklearnlibsite-packagesipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log1p
    
    from math import inf
    
    (train_master == -inf).sum().sum()
    
    206845
    
    train_master.replace(-inf, -1, inplace=True)
    
    # log后的密度图,应该分布靠近正态分布了
    melt = pd.melt(train_master, id_vars=['target'], value_vars = [f+'_log' for f in numerical_features])
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.distplot, "value")
    
    <seaborn.axisgrid.FacetGrid at 0x44f45c2470>
    

    png

    # log后的分布图,看是否有log后的outlier
    g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
    g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
    
    E:Anaconda3envssklearnlibsite-packagesseabornaxisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
      warnings.warn(warning)
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x44e4270908>
    

    png

    categorical 类别型特征

    melt = pd.melt(train_master, id_vars=['target'], value_vars=[f for f in categorical_features])
    g = sns.FacetGrid(melt, col='variable', col_wrap=4, sharex=False, sharey=False)
    g.map(sns.countplot, 'value', palette="muted")
    
    E:Anaconda3envssklearnlibsite-packagesseabornaxisgrid.py:715: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot.
      warnings.warn(warning)
    
    
    
    
    
    <seaborn.axisgrid.FacetGrid at 0x44e4c3eac8>
    

    png

    相关性查看

    target_corr = np.abs(train_master.corr()['target']).sort_values(ascending=False)
    target_corr
    
    target                            1.000000
    ThirdParty_Info_Period6_5_log     0.139606
    ThirdParty_Info_Period6_11_log    0.139083
    ThirdParty_Info_Period6_4_log     0.137962
    ThirdParty_Info_Period6_7_log     0.135729
    ThirdParty_Info_Period6_3_log     0.132310
    ThirdParty_Info_Period6_14_log    0.131138
    ThirdParty_Info_Period6_8_log     0.130577
    ThirdParty_Info_Period6_16_log    0.128451
    ThirdParty_Info_Period6_13_log    0.128013
    ThirdParty_Info_Period5_5_log     0.126701
    ThirdParty_Info_Period6_17_log    0.126456
    ThirdParty_Info_Period5_4_log     0.121786
    ThirdParty_Info_Period6_10_log    0.121729
    ThirdParty_Info_Period6_1_log     0.121112
    ThirdParty_Info_Period5_11_log    0.117162
    ThirdParty_Info_Period5_7_log     0.114794
    ThirdParty_Info_Period6_2_log     0.112041
    ThirdParty_Info_Period6_9_log     0.112039
    ThirdParty_Info_Period5_14_log    0.111374
    ThirdParty_Info_Period5_3_log     0.108039
    ThirdParty_Info_Period5_16_log    0.104786
    ThirdParty_Info_Period6_12_log    0.104733
    ThirdParty_Info_Period5_13_log    0.104688
    ThirdParty_Info_Period5_1_log     0.104191
    ThirdParty_Info_Period5_8_log     0.102859
    ThirdParty_Info_Period4_5_log     0.101329
    ThirdParty_Info_Period5_17_log    0.100960
    ThirdParty_Info_Period4_4_log     0.094715
    ThirdParty_Info_Period5_2_log     0.090261
                                        ...   
    ThirdParty_Info_Period4_15_log    0.004560
    b_ThirdParty_Info_Period4_12      0.004331
    b_WeblogInfo_13                   0.004090
    b_SocialNetwork_4                 0.003752
    b_SocialNetwork_3                 0.003752
    b_SocialNetwork_2                 0.003752
    b_SocialNetwork_16                0.003711
    b_SocialNetwork_6                 0.003701
    b_SocialNetwork_5                 0.003701
    b_WeblogInfo_44                   0.003542
    WeblogInfo_7_log                  0.003414
    b_WeblogInfo_32                   0.002961
    WeblogInfo_16_log                 0.002954
    b_ThirdParty_Info_Period2_12      0.002925
    b_WeblogInfo_29                   0.002550
    b_WeblogInfo_41                   0.002522
    ThirdParty_Info_Period4_6_log     0.002362
    b_WeblogInfo_11                   0.002257
    b_WeblogInfo_12                   0.002209
    b_WeblogInfo_8                    0.001922
    b_WeblogInfo_40                   0.001759
    b_WeblogInfo_36                   0.001554
    b_WeblogInfo_26                   0.001357
    ThirdParty_Info_Period1_3_log     0.000937
    b_WeblogInfo_31                   0.000896
    b_WeblogInfo_23                   0.000276
    ThirdParty_Info_Period1_8_log     0.000194
    b_WeblogInfo_38                   0.000077
    b_WeblogInfo_10                        NaN
    b_WeblogInfo_49                        NaN
    Name: target, Length: 215, dtype: float64
    
    # at_home,猜测UserInfo_2和UserInfo_8可能表示用户的当前居住地和户籍地,从而判断用户是否在老家。
    train_master['at_home'] = np.where(train_master['UserInfo_2']==train_master['UserInfo_8'], 1, 0)
    train_master['at_home']
    
    0        1
    1        1
    2        1
    3        1
    4        1
    5        0
    6        0
    7        0
    9        0
    10       1
    11       1
    12       1
    13       1
    14       0
    15       1
    16       0
    17       1
    18       1
    19       1
    20       1
    21       0
    22       0
    23       0
    24       0
    25       1
    26       1
    27       1
    28       1
    29       0
    30       1
            ..
    29970    0
    29971    1
    29972    1
    29973    0
    29974    0
    29975    1
    29976    0
    29977    1
    29978    1
    29979    1
    29980    0
    29981    0
    29982    1
    29983    0
    29984    0
    29985    0
    29986    0
    29987    1
    29988    1
    29989    1
    29990    0
    29991    1
    29992    1
    29993    1
    29994    0
    29995    1
    29996    1
    29997    0
    29998    0
    29999    1
    Name: at_home, Length: 28074, dtype: int32
    
    train_master_ = train_master.copy()
    
    def parse_ListingInfo(date):
        d = parse_date(date, 'YYYY/M/D')
        return Series(d, 
                      index=['ListingInfo_timestamp', 'ListingInfo_year', 'ListingInfo_month',
                               'ListingInfo_day', 'ListingInfo_week', 'ListingInfo_isoweekday', 'ListingInfo_month_stage'], 
                      dtype=np.int32)
    
    ListingInfo_parsed = train_master_['ListingInfo'].apply(parse_ListingInfo)
    print('before train_master_ shape {}'.format(train_master_.shape))
    train_master_ = train_master_.merge(ListingInfo_parsed, how='left', left_index=True, right_index=True)
    print('after train_master_ shape {}'.format(train_master_.shape))
    
    before train_master_ shape (28074, 223)
    after train_master_ shape (28074, 230)
    

    train_loginfo 借款人的登陆信息

    • 对Idx做group,提取记录数,LogInfo1独立数,活跃日期数,日期跨度
    def loginfo_aggr(group):
        # group的数量
        loginfo_num = group.shape[0]
        # 操作代码的数量
        loginfo_LogInfo1_unique_num = group['LogInfo1'].unique().shape[0]
        # 登录时间的数量
        loginfo_active_day_num = group['LogInfo3'].unique().shape[0]
        # 处理登录时间最小值
        min_day = parse_date(np.min(group['LogInfo3']), str_format='YYYY-MM-DD')
        # 处理登陆时间最大值
        max_day = parse_date(np.max(group['LogInfo3']), str_format='YYYY-MM-DD')
        # 最大值和最小值相差多少天
        gap_day = round((max_day[0] - min_day[0]) / 86400)
    
        indexes = {
            'loginfo_num': loginfo_num, 
            'loginfo_LogInfo1_unique_num': loginfo_LogInfo1_unique_num, 
            'loginfo_active_day_num': loginfo_active_day_num, 
            'loginfo_gap_day': gap_day, 
            'loginfo_last_day_timestamp': max_day[0]
        }
        
        # TODO every individual LogInfo1,LogInfo2 count
    
        def sub_aggr_loginfo(sub_group):
            return sub_group.shape[0]
    
        sub_group = group.groupby(by=['LogInfo1', 'LogInfo2']).apply(sub_aggr_loginfo)
        indexes['loginfo_LogInfo12_unique_num'] = sub_group.shape[0]
        return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
        
    train_loginfo_grouped = train_loginfo.groupby(by=['Idx']).apply(loginfo_aggr)
    train_loginfo_grouped.head()
    
    loginfo_num loginfo_LogInfo1_unique_num loginfo_active_day_num loginfo_gap_day loginfo_last_day_timestamp loginfo_LogInfo12_unique_num
    Idx
    3 26 4 8 63 1383264000 9
    5 11 6 4 13 1383696000 8
    8 125 7 13 12 1383696000 11
    12 199 8 11 328 1383264000 14
    16 15 4 7 8 1383523200 6
    train_loginfo_grouped.to_csv('train_loginfo_grouped.csv', header=True, index=True)
    
    train_loginfo_grouped = pd.read_csv('train_loginfo_grouped.csv')
    train_loginfo_grouped.head()
    
    Idx loginfo_num loginfo_LogInfo1_unique_num loginfo_active_day_num loginfo_gap_day loginfo_last_day_timestamp loginfo_LogInfo12_unique_num
    0 3 26 4 8 63 1383264000 9
    1 5 11 6 4 13 1383696000 8
    2 8 125 7 13 12 1383696000 11
    3 12 199 8 11 328 1383264000 14
    4 16 15 4 7 8 1383523200 6

    train_userinfo 借款人修改信息

    • 对于Idx做group,提取记录数,UserupdateInfo1独立数、UserupdateInfo1/UserupdateInfo2独立数,日期跨度。以及每种UserupdateInfo1/UserupdateInfo2的数量
    def userinfo_aggr(group):
        op_columns = ['_EducationId', '_HasBuyCar', '_LastUpdateDate',
           '_MarriageStatusId', '_MobilePhone', '_QQ', '_ResidenceAddress',
           '_ResidencePhone', '_ResidenceTypeId', '_ResidenceYears', '_age',
           '_educationId', '_gender', '_hasBuyCar', '_idNumber',
           '_lastUpdateDate', '_marriageStatusId', '_mobilePhone', '_qQ',
           '_realName', '_regStepId', '_residenceAddress', '_residencePhone',
           '_residenceTypeId', '_residenceYears', '_IsCash', '_CompanyPhone',
           '_IdNumber', '_Phone', '_RealName', '_CompanyName', '_Age',
           '_Gender', '_OtherWebShopType', '_turnover', '_WebShopTypeId',
           '_RelationshipId', '_CompanyAddress', '_Department',
           '_flag_UCtoBcp', '_flag_UCtoPVR', '_WorkYears', '_ByUserId',
           '_DormitoryPhone', '_IncomeFrom', '_CompanyTypeId',
           '_CompanySizeId', '_companyTypeId', '_department',
           '_companyAddress', '_workYears', '_contactId', '_creationDate',
           '_flag_UCtoBCP', '_orderId', '_phone', '_relationshipId', '_userId',
           '_companyName', '_companyPhone', '_isCash', '_BussinessAddress',
           '_webShopUrl', '_WebShopUrl', '_SchoolName', '_HasBusinessLicense',
           '_dormitoryPhone', '_incomeFrom', '_schoolName', '_NickName',
           '_CreationDate', '_CityId', '_DistrictId', '_ProvinceId',
           '_GraduateDate', '_GraduateSchool', '_IdAddress', '_companySizeId',
           '_HasPPDaiAccount', '_PhoneType', '_PPDaiAccount', '_SecondEmail',
           '_SecondMobile', '_nickName', '_HasSbOrGjj', '_Position']
    
        # group的数量
        userinfo_num = group.shape[0]
        # 修改内容的数量
        userinfo_unique_num = group['UserupdateInfo1'].unique().shape[0]
        # 修改时间的数量
        userinfo_active_day_num = group['UserupdateInfo2'].unique().shape[0]
        # 处理修改时间的最小值
        min_day = parse_date(np.min(group['UserupdateInfo2']))
        # 处理修改时间的最大值
        max_day = parse_date(np.max(group['UserupdateInfo2']))
        # 最小值和最大值相差几天
        gap_day = round((max_day[0] - min_day[0]) / (86400))
    
        indexes = {
            'userinfo_num': userinfo_num, 
            'userinfo_unique_num': userinfo_unique_num, 
            'userinfo_active_day_num': userinfo_active_day_num, 
            'userinfo_gap_day': gap_day, 
            'userinfo_last_day_timestamp': max_day[0]
        }
        
        for c in op_columns:
            indexes['userinfo' + c + '_num'] = 0
    
        def sub_aggr(sub_group):
            return sub_group.shape[0]
    
        sub_group = group.groupby(by=['UserupdateInfo1']).apply(sub_aggr)
        for c in sub_group.index:
            indexes['userinfo' + c + '_num'] = sub_group.loc[c]
        return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
        
    train_userinfo_grouped = train_userinfo.groupby(by=['Idx']).apply(userinfo_aggr)
    train_userinfo_grouped.head()
    
    userinfo_num userinfo_unique_num userinfo_active_day_num userinfo_gap_day userinfo_last_day_timestamp userinfo_EducationId_num userinfo_HasBuyCar_num userinfo_LastUpdateDate_num userinfo_MarriageStatusId_num userinfo_MobilePhone_num ... userinfo_IdAddress_num userinfo_companySizeId_num userinfo_HasPPDaiAccount_num userinfo_PhoneType_num userinfo_PPDaiAccount_num userinfo_SecondEmail_num userinfo_SecondMobile_num userinfo_nickName_num userinfo_HasSbOrGjj_num userinfo_Position_num
    Idx
    3 13 11 1 0 1377820800 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0
    5 13 11 1 0 1382572800 1 1 2 1 2 ... 0 0 0 0 0 0 0 0 0 0
    8 14 12 2 10 1383523200 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0
    12 14 14 2 298 1380672000 1 1 1 1 0 ... 0 0 0 0 0 0 0 0 0 0
    16 13 12 2 9 1383609600 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 91 columns

    train_userinfo_grouped.to_csv('train_userinfo_grouped.csv', header=True, index=True)
    
    train_userinfo_grouped = pd.read_csv('train_userinfo_grouped.csv')
    train_userinfo_grouped.head()
    
    Idx userinfo_num userinfo_unique_num userinfo_active_day_num userinfo_gap_day userinfo_last_day_timestamp userinfo_EducationId_num userinfo_HasBuyCar_num userinfo_LastUpdateDate_num userinfo_MarriageStatusId_num ... userinfo_IdAddress_num userinfo_companySizeId_num userinfo_HasPPDaiAccount_num userinfo_PhoneType_num userinfo_PPDaiAccount_num userinfo_SecondEmail_num userinfo_SecondMobile_num userinfo_nickName_num userinfo_HasSbOrGjj_num userinfo_Position_num
    0 3 13 11 1 0 1377820800 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
    1 5 13 11 1 0 1382572800 1 1 2 1 ... 0 0 0 0 0 0 0 0 0 0
    2 8 14 12 2 10 1383523200 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
    3 12 14 14 2 298 1380672000 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
    4 16 13 12 2 9 1383609600 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 92 columns

    print('before merge, train_master shape:{}'.format(train_master_.shape))
    
    # train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_index=True)
    # train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_index=True)
    
    train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_on='Idx')
    train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_on='Idx')
    
    train_master_.fillna(0, inplace=True)
    
    print('after merge, train_master shape:{}'.format(train_master_.shape))
    
    before merge, train_master shape:(28074, 230)
    after merge, train_master shape:(28074, 327)
    

    one-hot encoding features

    这里不要自动推算get_dummies所使用的列,pandas会自动选择object类型,而有些非object feature,实际含义也是categorical的,也需要被one-hot encoding

    drop_columns = ['Idx', 'ListingInfo', 'UserInfo_20',  'UserInfo_19', 'UserInfo_8', 'UserInfo_7', 
                    'UserInfo_4','UserInfo_2',
                   'ListingInfo_timestamp', 'loginfo_last_day_timestamp', 'userinfo_last_day_timestamp']
    train_master_ = train_master_.drop(drop_columns, axis=1)
    
    dummy_columns = categorical_features.copy()
    dummy_columns.extend(['ListingInfo_year', 'ListingInfo_month', 'ListingInfo_day', 'ListingInfo_week', 
                          'ListingInfo_isoweekday', 'ListingInfo_month_stage'])
    finally_dummy_columns = []
    
    for c in dummy_columns:
        if c not in drop_columns:
            finally_dummy_columns.append(c)
    
    print('before get_dummies train_master_ shape {}'.format(train_master_.shape))
    train_master_ = pd.get_dummies(train_master_, columns=finally_dummy_columns)
    print('after get_dummies train_master_ shape {}'.format(train_master_.shape))
    
    before get_dummies train_master_ shape (28074, 316)
    after get_dummies train_master_ shape (28074, 444)
    

    normalized

    from sklearn.preprocessing import StandardScaler
    
    X_train = train_master_.drop(['target'], axis=1)
    X_train = StandardScaler().fit_transform(X_train)
    y_train = train_master_['target']
    print(X_train.shape, y_train.shape)
    
    E:Anaconda3envssklearnlibsite-packagessklearnpreprocessingdata.py:617: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
      return self.partial_fit(X, y)
    
    
    (28074, 443) (28074,)
    
    
    E:Anaconda3envssklearnlibsite-packagessklearnase.py:462: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
      return self.fit(X, **fit_params).transform(X)
    
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import StratifiedKFold
    # from scikitplot import plotters as skplt
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.linear_model import RidgeClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from xgboost import XGBClassifier
    from sklearn.svm import SVC, LinearSVC
    
    # 使用StratifiedKFold分层采样,保证预测target的分布合理,并且shuffle随机
    cv = StratifiedKFold(n_splits=3, shuffle=True)
    
    # 计算auc accuracy recall
    def estimate(estimator, name='estimator'):
        auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
        accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
        recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()
    
        print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))
    
    #     skplt.plot_learning_curve(estimator, X_train, y_train)
    #     plt.show()
    
    #     estimator.fit(X_train, y_train)
    #     y_probas = estimator.predict_proba(X_train)
    #     skplt.plot_roc_curve(y_true=y_train, y_probas=y_probas)
    #     plt.show()
    
    estimate(XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic'), 'XGBClassifier')
    estimate(RidgeClassifier(), 'RidgeClassifier')
    estimate(LogisticRegression(), 'LogisticRegression')
    # estimate(RandomForestClassifier(), 'RandomForestClassifier')
    estimate(AdaBoostClassifier(), 'AdaBoostClassifier')
    # estimate(SVC(), 'SVC')# too long to wait
    # estimate(LinearSVC(), 'LinearSVC')
    
    # XGBClassifier: auc:0.747668, recall:0.000000, accuracy:0.944575
    # RidgeClassifier: auc:0.754218, recall:0.000000, accuracy:0.944433
    # LogisticRegression: auc:0.758454, recall:0.015424, accuracy:0.942010
    # AdaBoostClassifier: auc:0.784086, recall:0.013495, accuracy:0.943791
    
    XGBClassifier: auc:0.755890, recall:0.000000, accuracy:0.944575
    RidgeClassifier: auc:0.753939, recall:0.000000, accuracy:0.944575
    
    
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    
    
    LogisticRegression: auc:0.759646, recall:0.022494, accuracy:0.942438
    AdaBoostClassifier: auc:0.792333, recall:0.017988, accuracy:0.943827
    

    VotingClassifier

    from sklearn.ensemble import VotingClassifier
    
    estimators = []
    # estimators.append(('RidgeClassifier', RidgeClassifier()))
    estimators.append(('LogisticRegression', LogisticRegression()))
    estimators.append(('XGBClassifier', XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic')))
    estimators.append(('AdaBoostClassifier', AdaBoostClassifier()))
    # estimators.append(('RandomForestClassifier', RandomForestClassifier()))
    
    #voting: auc:0.794587, recall:0.000642, accuracy:0.944433
    
    voting = VotingClassifier(estimators = estimators, voting='soft')
    estimate(voting, 'voting')
    
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    E:Anaconda3envssklearnlibsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    
    
    voting: auc:0.790281, recall:0.000642, accuracy:0.944361
  • 相关阅读:
    Leetcode Spiral Matrix
    Leetcode Sqrt(x)
    Leetcode Pow(x,n)
    Leetcode Rotate Image
    Leetcode Multiply Strings
    Leetcode Length of Last Word
    Topcoder SRM 626 DIV2 SumOfPower
    Topcoder SRM 626 DIV2 FixedDiceGameDiv2
    Leetcode Largest Rectangle in Histogram
    Leetcode Set Matrix Zeroes
  • 原文地址:https://www.cnblogs.com/chenxiangzhen/p/10557424.html
Copyright © 2011-2022 走看看