zoukankan      html  css  js  c++  java
  • Kaggle-Feature Engineering(1)

    学习特征工程可以帮助我们探索改善模型的最有效方法。

    在本课程中,您将学习一种实用的特征工程方法。 您将能够将学到的知识应用到Kaggle竞赛和其他机器学习应用程序中。

    Baseline-model

    先构建一个基础可用的模型

    Load the data

    我们将处理来自Kickstarter项目的数据。 数据的前几行如下所示:


    ID
    namecategorymain_categorycurrencydeadlinegoallaunchedpledgedstatebackerscountryusd pledgedusd_pledged_realusd_goal_real
    0 1000002330 The Songs of Adelaide & Abullah Poetry Publishing GBP 2015-10-09 1000.0 2015-08-11 12:12:28 0.0 failed 0 GB 0.0 0.0 1533.95
    1 1000003930 Greeting From Earth: ZGAC Arts Capsule For ET Narrative Film Film & Video USD 2017-11-01 30000.0 2017-09-02 04:43:57 2421.0 failed 15 US 100.0 2421.0 30000.00
    2 1000004038 Where is Hank? Narrative Film Film & Video USD 2013-02-26 45000.0 2013-01-12 00:20:50 220.0 failed 3 US 220.0 220.0 45000.00
    3 1000007540 ToshiCapital Rekordz Needs Help to Complete Album Music Music USD 2012-04-16 5000.0 2012-03-17 03:24:11 1.0 failed 1 US 1.0 1.0 5000.00
    4 1000011046 Community Film Project: The Art of Neighborhoo... Film & Video Film & Video USD 2015-08-29 19500.0 2015-07-04 08:35:03 1283.0 canceled 14 US 1283.0 1283.0 19500.00
    5 1000014025 Monarch Espresso Bar Restaurants Food USD 2016-04-01 50000.0 2016-02-26 13:38:27 52375.0 successful 224 US 52375.0 52375.0 50000.00

    state栏显示项目的结果。

    print('Unique values in `state` column:', list(ks.state.unique()))

    Output:

    Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']

    利用这些数据,我们如何使用项目类别,货币,资金目标和国家/地区等功能来预测Kickstarter项目能否成功?

    准备目标列

    首先,我们将状态列转换为可在模型中使用的目标。 数据清理不是当前的重点,因此我们将通过以下方式简化此示例:
    1.删除“实时(live)”项目
    2.将“成功”状态计为结果= 1
    3.将所有其他状态组合为结果= 0

    # Drop live projects
    ks = ks.query('state != "live"')
    
    # Add outcome column, "successful" == 1, others are 0
    ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))

    转换时间戳

    接下来,我们将启动的功能转换为可在模型中使用的分类功能。 由于我们将列作为时间戳数据加载,因此我们可以通过timestamp列上的.dt属性访问日期和时间值。

    ks = ks.assign(hour=ks.launched.dt.hour,
                   day=ks.launched.dt.day,
                   month=ks.launched.dt.month,
                   year=ks.launched.dt.year)

    准备类别变量

    现在,对于类别变量(类别,货币和国家/地区),我们需要将它们转换为整数,以便我们的模型可以使用数据。 为此,我们将使用scikit-learn的LabelEncoder。 这会为分类特征的每个值分配一个整数。

    from sklearn.preprocessing import LabelEncoder
    
    cat_features = ['category', 'currency', 'country']
    encoder = LabelEncoder()
    
    # Apply the label encoder to each column
    encoded = ks[cat_features].apply(encoder.fit_transform)

    我们将所有这些功能收集到一个新的数据框中,以用于训练模型。

    # Since ks and encoded have the same index and I can easily join them
    data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded)
    data.head()

    Output:


    goal
    hourdaymonthyearoutcomecategorycurrencycountry
    0 1000.0 12 11 8 2015 0 108 5 9
    1 30000.0 4 2 9 2017 0 93 13 22
    2 45000.0 0 12 1 2013 0 93 13 22
    3 5000.0 3 17 3 2012 0 90 13 22
    4 19500.0 8 4 7 2015 0 55 13 22

     创建训练,验证和测试分组

     我们需要创建用于训练,验证和测试的数据集。 我们将使用一种非常简单的方法,并使用切片分割数据。 我们将使用10%的数据作为验证集,将10%的数据用于测试,其余80%的数据用于训练。

    valid_fraction = 0.1
    valid_size = int(len(data) * valid_fraction)
    
    train = data[:-2 * valid_size]
    valid = data[-2 * valid_size:-valid_size]
    test = data[-valid_size:]

    训练模型

    在本课程中,我们将使用LightGBM模型。 这是一个基于树的模型,即使与XGBoost相比,也通常可提供最佳性能。 训练也相对较快。
    我们不会进行超参数优化,因为这不是本课程的目标。 因此,我们的模型并不是您可以获得的绝对最佳性能。 但是随着我们进行特征工程设计,您仍然会看到模型性能的提高。

    import lightgbm as lgb
    
    feature_cols = train.columns.drop('outcome')
    
    dtrain = lgb.Dataset(train[feature_cols], label=train['outcome'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome'])
    
    param = {'num_leaves': 64, 'objective': 'binary'}
    param['metric'] = 'auc'
    num_round = 1000
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)
    verbose_eval:迭代多少次打印
    early_stopping_rounds:有多少次分数没有提高则停止

    做出预测并评估模型

    最后,让我们使用模型对测试集进行预测,并查看其性能如何。 要记住的重要一点是,您可能会过度拟合验证数据。 这就是为什么我们需要模型在最终评估之前看不到的测试集的原因。

    from sklearn import metrics
    ypred = bst.predict(test[feature_cols])
    score = metrics.roc_auc_score(test['outcome'], ypred)
    
    print(f"Test AUC score: {score}")
    Test AUC score: 0.747615303004287

     练习

    在练习中,您将使用TalkingData AdTracking竞赛中的数据。 竞赛的目的是预测用户在点击广告后是否会下载应用。
    在本课程中,您将使用一小部分数据样本,删除99%的负面记录(未下载应用程序的记录),以使目标更加平衡。
    建立基准模型后,您将能够看到要素工程和选择工作如何改善模型的性能。

    Baseline Model

    您要做的第一件事是构造基线模型。 我们将从查看数据开始。

    import pandas as pd
    
    click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv',
                             parse_dates=['click_time'])
    click_data.head()

    ip
    appdeviceoschannelclick_timeattributed_timeis_attributed
    0 89489 3 1 13 379 2017-11-06 15:13:23 NaN 0
    1 204158 35 1 13 21 2017-11-06 15:41:07 2017-11-07 08:17:19 1
    2 3437 6 1 13 459 2017-11-06 15:42:32 NaN 0
    3 167543 3 1 13 379 2017-11-06 15:56:17 NaN 0
    4 147509 3 1 13 379 2017-11-06 15:57:01 NaN 0

     1)从时间戳构造特征

    请注意,click_data DataFrame的“ click_time”列包含时间戳数据。使用此列可为相应的日,时,分和秒创建功能。将这些字段存储为新的DataFrame单击中的日,小时,分钟和秒的新整数列。

    # Add new columns for timestamp features day, hour, minute, and second
    clicks = click_data.copy()
    clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
    # Fill in the rest
    clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
    clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
    clicks['second'] = clicks['click_time'].dt.second.astype('uint8')
    
    
    
    # Check your answer
    q_1.check()

    2)标签编码

     对于每个分类功能['ip','app','device','os','channel'],使用scikit-learn的LabelEncoder在点击DataFrame中创建新功能。 新列名称应为原始列名称,并附加“ _labels”,例如ip_labels。

    from sklearn import preprocessing
    
    cat_features = ['ip', 'app', 'device', 'os', 'channel']
    encoder = preprocessing.LabelEncoder()
    
    # Apply the label encoder to each column
    click_data[cat_features].apply(encoder.fit_transform)
    # Create new columns in clicks using preprocessing.LabelEncoder()
    for feature in cat_features:
        encoded = encoder.fit_transform(clicks[feature])
        clicks[feature + '_labels'] = encoded
    
    # Check your answer
    q_2.check()

    查看处理之后的表

    3)独热编码

    在上面的代码单元中,您使用了标签编码功能。替换为分类变量“ ip”,“ app”,“device”,“ os”或“channel”使用单热编码是否可以?

    ip列具有58,000个值,这意味着它将创建一个具有58,000列的极其稀疏的矩阵。 如此多的列将使您的模型运行非常慢,因此通常您要避免使用具有多个级别的一键编码功能。 LightGBM模型可与标签编码功能一起使用,因此您实际上不需要对分类功能进行一次性编码。

    训练,验证和测试集

    准备好基线功能后,我们需要将数据分为训练和验证集。 我们还应该提供一个测试集来衡量模型的最终准确性。

    4)使用时间序列数据进行训练/测试的拆分

    这是时间序列数据。 为时间序列创建训练/测试拆分时,是否有任何特殊考虑? 如果是这样,它们是什么

    由于我们的模型旨在预测未来的事件,因此我们还必须验证未来事件的模型。 如果数据在训练集和测试集之间混合在一起,则将来的数据将泄漏到模型中,并且我们的验证结果将高估新数据的性能。

    创建训练/验证/测试分组

    在这里,我们将创建训练,验证和测试分组。 首先,单击DataFrame是按时间增加的顺序进行排序的。 行的前80%是训练集,接下来的10%是验证集,最后10%是测试集。不能使用k折交叉验证,原因在上面。

    feature_cols = ['day', 'hour', 'minute', 'second', 
                    'ip_labels', 'app_labels', 'device_labels',
                    'os_labels', 'channel_labels']
    
    valid_fraction = 0.1
    clicks_srt = clicks.sort_values('click_time')
    valid_rows = int(len(clicks_srt) * valid_fraction)
    train = clicks_srt[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = clicks_srt[-valid_rows * 2:-valid_rows]
    test = clicks_srt[-valid_rows:]

    用LightGBM训练

    现在,我们可以为每个较小的数据集创建LightGBM数据集对象,并训练基准模型。

    import lightgbm as lgb
    
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary'}
    param['metric'] = 'auc'
    num_round = 1000
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

    Output:

    [1]    valid_0's auc: 0.948979
    Training until validation scores don't improve for 10 rounds.
    [2]    valid_0's auc: 0.949235
    [3]    valid_0's auc: 0.950126
    [4]    valid_0's auc: 0.950072
    [5]    valid_0's auc: 0.950536
    [6]    valid_0's auc: 0.950943
    [7]    valid_0's auc: 0.951453
    [8]    valid_0's auc: 0.951518
    [9]    valid_0's auc: 0.952385
    [10]    valid_0's auc: 0.952434
    [11]    valid_0's auc: 0.952465
    [12]    valid_0's auc: 0.952638
    [13]    valid_0's auc: 0.95266
    [14]    valid_0's auc: 0.952766
    [15]    valid_0's auc: 0.953203
    [16]    valid_0's auc: 0.953503
    [17]    valid_0's auc: 0.953793
    [18]    valid_0's auc: 0.953966
    [19]    valid_0's auc: 0.954184
    [20]    valid_0's auc: 0.9543
    [21]    valid_0's auc: 0.954305
    [22]    valid_0's auc: 0.954536
    [23]    valid_0's auc: 0.954748
    [24]    valid_0's auc: 0.955142
    [25]    valid_0's auc: 0.955493
    [26]    valid_0's auc: 0.955611
    [27]    valid_0's auc: 0.955708
    [28]    valid_0's auc: 0.955795
    [29]    valid_0's auc: 0.956172
    [30]    valid_0's auc: 0.95623
    [31]    valid_0's auc: 0.956477
    [32]    valid_0's auc: 0.956606
    [33]    valid_0's auc: 0.956864
    [34]    valid_0's auc: 0.957204
    [35]    valid_0's auc: 0.957327
    [36]    valid_0's auc: 0.957408
    [37]    valid_0's auc: 0.957524
    [38]    valid_0's auc: 0.957659
    [39]    valid_0's auc: 0.957846
    [40]    valid_0's auc: 0.958042
    [41]    valid_0's auc: 0.958146
    [42]    valid_0's auc: 0.958181
    [43]    valid_0's auc: 0.958285
    [44]    valid_0's auc: 0.958433
    [45]    valid_0's auc: 0.95854
    [46]    valid_0's auc: 0.958625
    [47]    valid_0's auc: 0.958756
    [48]    valid_0's auc: 0.958863
    [49]    valid_0's auc: 0.958938
    [50]    valid_0's auc: 0.959046
    [51]    valid_0's auc: 0.95908
    [52]    valid_0's auc: 0.959147
    [53]    valid_0's auc: 0.9592
    [54]    valid_0's auc: 0.959259
    [55]    valid_0's auc: 0.959311
    [56]    valid_0's auc: 0.959324
    [57]    valid_0's auc: 0.959348
    [58]    valid_0's auc: 0.959435
    [59]    valid_0's auc: 0.959463
    [60]    valid_0's auc: 0.95949
    [61]    valid_0's auc: 0.959562
    [62]    valid_0's auc: 0.959721
    [63]    valid_0's auc: 0.959729
    [64]    valid_0's auc: 0.959773
    [65]    valid_0's auc: 0.959809
    [66]    valid_0's auc: 0.959868
    [67]    valid_0's auc: 0.959921
    [68]    valid_0's auc: 0.959994
    [69]    valid_0's auc: 0.960065
    [70]    valid_0's auc: 0.96011
    [71]    valid_0's auc: 0.960133
    [72]    valid_0's auc: 0.960275
    [73]    valid_0's auc: 0.960299
    [74]    valid_0's auc: 0.960336
    [75]    valid_0's auc: 0.960365
    [76]    valid_0's auc: 0.960411
    [77]    valid_0's auc: 0.960488
    [78]    valid_0's auc: 0.960523
    [79]    valid_0's auc: 0.960563
    [80]    valid_0's auc: 0.960624
    [81]    valid_0's auc: 0.960665
    [82]    valid_0's auc: 0.960724
    [83]    valid_0's auc: 0.960724
    [84]    valid_0's auc: 0.960751
    [85]    valid_0's auc: 0.960799
    [86]    valid_0's auc: 0.960853
    [87]    valid_0's auc: 0.960876
    [88]    valid_0's auc: 0.960934
    [89]    valid_0's auc: 0.961012
    [90]    valid_0's auc: 0.961012
    [91]    valid_0's auc: 0.961065
    [92]    valid_0's auc: 0.961095
    [93]    valid_0's auc: 0.961131
    [94]    valid_0's auc: 0.961136
    [95]    valid_0's auc: 0.961155
    [96]    valid_0's auc: 0.961191
    [97]    valid_0's auc: 0.961189
    [98]    valid_0's auc: 0.961189
    [99]    valid_0's auc: 0.961224
    [100]    valid_0's auc: 0.961228
    [101]    valid_0's auc: 0.96125
    [102]    valid_0's auc: 0.961259
    [103]    valid_0's auc: 0.961289
    [104]    valid_0's auc: 0.961309
    [105]    valid_0's auc: 0.961309
    [106]    valid_0's auc: 0.96134
    [107]    valid_0's auc: 0.961373
    [108]    valid_0's auc: 0.961382
    [109]    valid_0's auc: 0.961391
    [110]    valid_0's auc: 0.961402
    [111]    valid_0's auc: 0.961449
    [112]    valid_0's auc: 0.96145
    [113]    valid_0's auc: 0.961482
    [114]    valid_0's auc: 0.961481
    [115]    valid_0's auc: 0.961492
    [116]    valid_0's auc: 0.961513
    [117]    valid_0's auc: 0.961531
    [118]    valid_0's auc: 0.961539
    [119]    valid_0's auc: 0.961563
    [120]    valid_0's auc: 0.961563
    [121]    valid_0's auc: 0.961568
    [122]    valid_0's auc: 0.961588
    [123]    valid_0's auc: 0.961599
    [124]    valid_0's auc: 0.961605
    [125]    valid_0's auc: 0.961605
    [126]    valid_0's auc: 0.96161
    [127]    valid_0's auc: 0.961626
    [128]    valid_0's auc: 0.961626
    [129]    valid_0's auc: 0.96163
    [130]    valid_0's auc: 0.961646
    [131]    valid_0's auc: 0.961678
    [132]    valid_0's auc: 0.961672
    [133]    valid_0's auc: 0.961673
    [134]    valid_0's auc: 0.96171
    [135]    valid_0's auc: 0.96171
    [136]    valid_0's auc: 0.961724
    [137]    valid_0's auc: 0.961723
    [138]    valid_0's auc: 0.961726
    [139]    valid_0's auc: 0.961731
    [140]    valid_0's auc: 0.961736
    [141]    valid_0's auc: 0.961751
    [142]    valid_0's auc: 0.961759
    [143]    valid_0's auc: 0.961777
    [144]    valid_0's auc: 0.961777
    [145]    valid_0's auc: 0.961779
    [146]    valid_0's auc: 0.961782
    [147]    valid_0's auc: 0.961782
    [148]    valid_0's auc: 0.961796
    [149]    valid_0's auc: 0.961799
    [150]    valid_0's auc: 0.961806
    [151]    valid_0's auc: 0.961804
    [152]    valid_0's auc: 0.961805
    [153]    valid_0's auc: 0.961794
    [154]    valid_0's auc: 0.961802
    [155]    valid_0's auc: 0.961805
    [156]    valid_0's auc: 0.961821
    [157]    valid_0's auc: 0.961853
    [158]    valid_0's auc: 0.96187
    [159]    valid_0's auc: 0.961875
    [160]    valid_0's auc: 0.961877
    [161]    valid_0's auc: 0.961889
    [162]    valid_0's auc: 0.961894
    [163]    valid_0's auc: 0.961898
    [164]    valid_0's auc: 0.961901
    [165]    valid_0's auc: 0.961911
    [166]    valid_0's auc: 0.961911
    [167]    valid_0's auc: 0.961915
    [168]    valid_0's auc: 0.961925
    [169]    valid_0's auc: 0.961925
    [170]    valid_0's auc: 0.961929
    [171]    valid_0's auc: 0.961949
    [172]    valid_0's auc: 0.961945
    [173]    valid_0's auc: 0.961945
    [174]    valid_0's auc: 0.961944
    [175]    valid_0's auc: 0.961946
    [176]    valid_0's auc: 0.961952
    [177]    valid_0's auc: 0.961956
    [178]    valid_0's auc: 0.961958
    [179]    valid_0's auc: 0.961971
    [180]    valid_0's auc: 0.961998
    [181]    valid_0's auc: 0.961998
    [182]    valid_0's auc: 0.962014
    [183]    valid_0's auc: 0.962018
    [184]    valid_0's auc: 0.962016
    [185]    valid_0's auc: 0.962022
    [186]    valid_0's auc: 0.962031
    [187]    valid_0's auc: 0.96203
    [188]    valid_0's auc: 0.962021
    [189]    valid_0's auc: 0.962021
    [190]    valid_0's auc: 0.962022
    [191]    valid_0's auc: 0.962026
    [192]    valid_0's auc: 0.962038
    [193]    valid_0's auc: 0.962042
    [194]    valid_0's auc: 0.962041
    [195]    valid_0's auc: 0.962035
    [196]    valid_0's auc: 0.962037
    [197]    valid_0's auc: 0.962048
    [198]    valid_0's auc: 0.962054
    [199]    valid_0's auc: 0.962052
    [200]    valid_0's auc: 0.962054
    [201]    valid_0's auc: 0.962041
    [202]    valid_0's auc: 0.962041
    [203]    valid_0's auc: 0.962052
    [204]    valid_0's auc: 0.962051
    [205]    valid_0's auc: 0.962056
    [206]    valid_0's auc: 0.962056
    [207]    valid_0's auc: 0.962069
    [208]    valid_0's auc: 0.962072
    [209]    valid_0's auc: 0.962072
    [210]    valid_0's auc: 0.962062
    [211]    valid_0's auc: 0.962064
    [212]    valid_0's auc: 0.962066
    [213]    valid_0's auc: 0.962066
    [214]    valid_0's auc: 0.962066
    [215]    valid_0's auc: 0.962064
    [216]    valid_0's auc: 0.96206
    [217]    valid_0's auc: 0.962059
    [218]    valid_0's auc: 0.962059
    Early stopping, best iteration is:
    [208]    valid_0's auc: 0.962072
    View Code

     评估模型

     最后,通过训练模型,我们在测试集上评估其性能。

    from sklearn import metrics
    
    ypred = bst.predict(test[feature_cols])
    score = metrics.roc_auc_score(test['is_attributed'], ypred)
    print(f"Test score: {score}")
    Test score: 0.9726727334566094

    这将是我们对该模型的基准分数。 当我们转换要素,添加新要素或执行要素选择时,我们应该在此分数上有所改进。 但是,由于这是测试集,因此我们只想在所有操作结束时对其进行查看。 在本课程的最后,您将再次查看测试成绩,以查看是否在基线模型上有所改善。

  • 相关阅读:
    使用Kmeans进行聚类,用calinski_harabaz_score评价聚类效果
    使用Autoencoder进行降维
    MongoDB 之 MongoDB简介与安装 MongoDB 1
    我的淘宝客之路 起步
    Excel批量导入Orale
    CAB压缩包文件制作
    策略模式
    设计OA系统的用户角色权限分配
    java语言中的限定词
    jQuery LigerUI使用培训
  • 原文地址:https://www.cnblogs.com/caishunzhe/p/13594055.html
Copyright © 2011-2022 走看看