学习特征工程可以帮助我们探索改善模型的最有效方法。
在本课程中,您将学习一种实用的特征工程方法。 您将能够将学到的知识应用到Kaggle竞赛和其他机器学习应用程序中。
Baseline-model
先构建一个基础可用的模型
Load the data
我们将处理来自Kickstarter项目的数据。 数据的前几行如下所示:
ID | name | category | main_category | currency | deadline | goal | launched | pledged | state | backers | country | usd pledged | usd_pledged_real | usd_goal_real | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1000002330 | The Songs of Adelaide & Abullah | Poetry | Publishing | GBP | 2015-10-09 | 1000.0 | 2015-08-11 12:12:28 | 0.0 | failed | 0 | GB | 0.0 | 0.0 | 1533.95 |
1 | 1000003930 | Greeting From Earth: ZGAC Arts Capsule For ET | Narrative Film | Film & Video | USD | 2017-11-01 | 30000.0 | 2017-09-02 04:43:57 | 2421.0 | failed | 15 | US | 100.0 | 2421.0 | 30000.00 |
2 | 1000004038 | Where is Hank? | Narrative Film | Film & Video | USD | 2013-02-26 | 45000.0 | 2013-01-12 00:20:50 | 220.0 | failed | 3 | US | 220.0 | 220.0 | 45000.00 |
3 | 1000007540 | ToshiCapital Rekordz Needs Help to Complete Album | Music | Music | USD | 2012-04-16 | 5000.0 | 2012-03-17 03:24:11 | 1.0 | failed | 1 | US | 1.0 | 1.0 | 5000.00 |
4 | 1000011046 | Community Film Project: The Art of Neighborhoo... | Film & Video | Film & Video | USD | 2015-08-29 | 19500.0 | 2015-07-04 08:35:03 | 1283.0 | canceled | 14 | US | 1283.0 | 1283.0 | 19500.00 |
5 | 1000014025 | Monarch Espresso Bar | Restaurants | Food | USD | 2016-04-01 | 50000.0 | 2016-02-26 13:38:27 | 52375.0 | successful | 224 | US | 52375.0 | 52375.0 | 50000.00 |
state栏显示项目的结果。
print('Unique values in `state` column:', list(ks.state.unique()))
Output:
Unique values in `state` column: ['failed', 'canceled', 'successful', 'live', 'undefined', 'suspended']
利用这些数据,我们如何使用项目类别,货币,资金目标和国家/地区等功能来预测Kickstarter项目能否成功?
准备目标列
首先,我们将状态列转换为可在模型中使用的目标。 数据清理不是当前的重点,因此我们将通过以下方式简化此示例:
1.删除“实时(live)”项目
2.将“成功”状态计为结果= 1
3.将所有其他状态组合为结果= 0
# Drop live projects ks = ks.query('state != "live"') # Add outcome column, "successful" == 1, others are 0 ks = ks.assign(outcome=(ks['state'] == 'successful').astype(int))
转换时间戳
接下来,我们将启动的功能转换为可在模型中使用的分类功能。 由于我们将列作为时间戳数据加载,因此我们可以通过timestamp列上的.dt属性访问日期和时间值。
ks = ks.assign(hour=ks.launched.dt.hour, day=ks.launched.dt.day, month=ks.launched.dt.month, year=ks.launched.dt.year)
准备类别变量
现在,对于类别变量(类别,货币和国家/地区),我们需要将它们转换为整数,以便我们的模型可以使用数据。 为此,我们将使用scikit-learn的LabelEncoder。 这会为分类特征的每个值分配一个整数。
from sklearn.preprocessing import LabelEncoder cat_features = ['category', 'currency', 'country'] encoder = LabelEncoder() # Apply the label encoder to each column encoded = ks[cat_features].apply(encoder.fit_transform)
我们将所有这些功能收集到一个新的数据框中,以用于训练模型。
# Since ks and encoded have the same index and I can easily join them data = ks[['goal', 'hour', 'day', 'month', 'year', 'outcome']].join(encoded) data.head()
Output:
goal | hour | day | month | year | outcome | category | currency | country | |
---|---|---|---|---|---|---|---|---|---|
0 | 1000.0 | 12 | 11 | 8 | 2015 | 0 | 108 | 5 | 9 |
1 | 30000.0 | 4 | 2 | 9 | 2017 | 0 | 93 | 13 | 22 |
2 | 45000.0 | 0 | 12 | 1 | 2013 | 0 | 93 | 13 | 22 |
3 | 5000.0 | 3 | 17 | 3 | 2012 | 0 | 90 | 13 | 22 |
4 | 19500.0 | 8 | 4 | 7 | 2015 | 0 | 55 | 13 | 22 |
创建训练,验证和测试分组
我们需要创建用于训练,验证和测试的数据集。 我们将使用一种非常简单的方法,并使用切片分割数据。 我们将使用10%的数据作为验证集,将10%的数据用于测试,其余80%的数据用于训练。
valid_fraction = 0.1 valid_size = int(len(data) * valid_fraction) train = data[:-2 * valid_size] valid = data[-2 * valid_size:-valid_size] test = data[-valid_size:]
训练模型
在本课程中,我们将使用LightGBM模型。 这是一个基于树的模型,即使与XGBoost相比,也通常可提供最佳性能。 训练也相对较快。
我们不会进行超参数优化,因为这不是本课程的目标。 因此,我们的模型并不是您可以获得的绝对最佳性能。 但是随着我们进行特征工程设计,您仍然会看到模型性能的提高。
import lightgbm as lgb feature_cols = train.columns.drop('outcome') dtrain = lgb.Dataset(train[feature_cols], label=train['outcome']) dvalid = lgb.Dataset(valid[feature_cols], label=valid['outcome']) param = {'num_leaves': 64, 'objective': 'binary'} param['metric'] = 'auc' num_round = 1000 bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10, verbose_eval=False)
early_stopping_rounds:有多少次分数没有提高则停止
做出预测并评估模型
最后,让我们使用模型对测试集进行预测,并查看其性能如何。 要记住的重要一点是,您可能会过度拟合验证数据。 这就是为什么我们需要模型在最终评估之前看不到的测试集的原因。
from sklearn import metrics ypred = bst.predict(test[feature_cols]) score = metrics.roc_auc_score(test['outcome'], ypred) print(f"Test AUC score: {score}")
Test AUC score: 0.747615303004287
练习
在练习中,您将使用TalkingData AdTracking竞赛中的数据。 竞赛的目的是预测用户在点击广告后是否会下载应用。
在本课程中,您将使用一小部分数据样本,删除99%的负面记录(未下载应用程序的记录),以使目标更加平衡。
建立基准模型后,您将能够看到要素工程和选择工作如何改善模型的性能。
Baseline Model
您要做的第一件事是构造基线模型。 我们将从查看数据开始。
import pandas as pd click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv', parse_dates=['click_time']) click_data.head()
ip | app | device | os | channel | click_time | attributed_time | is_attributed | |
---|---|---|---|---|---|---|---|---|
0 | 89489 | 3 | 1 | 13 | 379 | 2017-11-06 15:13:23 | NaN | 0 |
1 | 204158 | 35 | 1 | 13 | 21 | 2017-11-06 15:41:07 | 2017-11-07 08:17:19 | 1 |
2 | 3437 | 6 | 1 | 13 | 459 | 2017-11-06 15:42:32 | NaN | 0 |
3 | 167543 | 3 | 1 | 13 | 379 | 2017-11-06 15:56:17 | NaN | 0 |
4 | 147509 | 3 | 1 | 13 | 379 | 2017-11-06 15:57:01 | NaN | 0 |
1)从时间戳构造特征
请注意,click_data DataFrame的“ click_time”列包含时间戳数据。使用此列可为相应的日,时,分和秒创建功能。将这些字段存储为新的DataFrame单击中的日,小时,分钟和秒的新整数列。
# Add new columns for timestamp features day, hour, minute, and second clicks = click_data.copy() clicks['day'] = clicks['click_time'].dt.day.astype('uint8') # Fill in the rest clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8') clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8') clicks['second'] = clicks['click_time'].dt.second.astype('uint8') # Check your answer q_1.check()
2)标签编码
对于每个分类功能['ip','app','device','os','channel'],使用scikit-learn的LabelEncoder在点击DataFrame中创建新功能。 新列名称应为原始列名称,并附加“ _labels”,例如ip_labels。
from sklearn import preprocessing cat_features = ['ip', 'app', 'device', 'os', 'channel'] encoder = preprocessing.LabelEncoder() # Apply the label encoder to each column click_data[cat_features].apply(encoder.fit_transform) # Create new columns in clicks using preprocessing.LabelEncoder() for feature in cat_features: encoded = encoder.fit_transform(clicks[feature]) clicks[feature + '_labels'] = encoded # Check your answer q_2.check()
查看处理之后的表
3)独热编码
在上面的代码单元中,您使用了标签编码功能。替换为分类变量“ ip”,“ app”,“device”,“ os”或“channel”使用单热编码是否可以?
ip列具有58,000个值,这意味着它将创建一个具有58,000列的极其稀疏的矩阵。 如此多的列将使您的模型运行非常慢,因此通常您要避免使用具有多个级别的一键编码功能。 LightGBM模型可与标签编码功能一起使用,因此您实际上不需要对分类功能进行一次性编码。
训练,验证和测试集
准备好基线功能后,我们需要将数据分为训练和验证集。 我们还应该提供一个测试集来衡量模型的最终准确性。
4)使用时间序列数据进行训练/测试的拆分
这是时间序列数据。 为时间序列创建训练/测试拆分时,是否有任何特殊考虑? 如果是这样,它们是什么
由于我们的模型旨在预测未来的事件,因此我们还必须验证未来事件的模型。 如果数据在训练集和测试集之间混合在一起,则将来的数据将泄漏到模型中,并且我们的验证结果将高估新数据的性能。
创建训练/验证/测试分组
在这里,我们将创建训练,验证和测试分组。 首先,单击DataFrame是按时间增加的顺序进行排序的。 行的前80%是训练集,接下来的10%是验证集,最后10%是测试集。不能使用k折交叉验证,原因在上面。
feature_cols = ['day', 'hour', 'minute', 'second', 'ip_labels', 'app_labels', 'device_labels', 'os_labels', 'channel_labels'] valid_fraction = 0.1 clicks_srt = clicks.sort_values('click_time') valid_rows = int(len(clicks_srt) * valid_fraction) train = clicks_srt[:-valid_rows * 2] # valid size == test size, last two sections of the data valid = clicks_srt[-valid_rows * 2:-valid_rows] test = clicks_srt[-valid_rows:]
用LightGBM训练
现在,我们可以为每个较小的数据集创建LightGBM数据集对象,并训练基准模型。
import lightgbm as lgb dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed']) dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed']) dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed']) param = {'num_leaves': 64, 'objective': 'binary'} param['metric'] = 'auc' num_round = 1000 bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)
Output:
[1] valid_0's auc: 0.948979 Training until validation scores don't improve for 10 rounds. [2] valid_0's auc: 0.949235 [3] valid_0's auc: 0.950126 [4] valid_0's auc: 0.950072 [5] valid_0's auc: 0.950536 [6] valid_0's auc: 0.950943 [7] valid_0's auc: 0.951453 [8] valid_0's auc: 0.951518 [9] valid_0's auc: 0.952385 [10] valid_0's auc: 0.952434 [11] valid_0's auc: 0.952465 [12] valid_0's auc: 0.952638 [13] valid_0's auc: 0.95266 [14] valid_0's auc: 0.952766 [15] valid_0's auc: 0.953203 [16] valid_0's auc: 0.953503 [17] valid_0's auc: 0.953793 [18] valid_0's auc: 0.953966 [19] valid_0's auc: 0.954184 [20] valid_0's auc: 0.9543 [21] valid_0's auc: 0.954305 [22] valid_0's auc: 0.954536 [23] valid_0's auc: 0.954748 [24] valid_0's auc: 0.955142 [25] valid_0's auc: 0.955493 [26] valid_0's auc: 0.955611 [27] valid_0's auc: 0.955708 [28] valid_0's auc: 0.955795 [29] valid_0's auc: 0.956172 [30] valid_0's auc: 0.95623 [31] valid_0's auc: 0.956477 [32] valid_0's auc: 0.956606 [33] valid_0's auc: 0.956864 [34] valid_0's auc: 0.957204 [35] valid_0's auc: 0.957327 [36] valid_0's auc: 0.957408 [37] valid_0's auc: 0.957524 [38] valid_0's auc: 0.957659 [39] valid_0's auc: 0.957846 [40] valid_0's auc: 0.958042 [41] valid_0's auc: 0.958146 [42] valid_0's auc: 0.958181 [43] valid_0's auc: 0.958285 [44] valid_0's auc: 0.958433 [45] valid_0's auc: 0.95854 [46] valid_0's auc: 0.958625 [47] valid_0's auc: 0.958756 [48] valid_0's auc: 0.958863 [49] valid_0's auc: 0.958938 [50] valid_0's auc: 0.959046 [51] valid_0's auc: 0.95908 [52] valid_0's auc: 0.959147 [53] valid_0's auc: 0.9592 [54] valid_0's auc: 0.959259 [55] valid_0's auc: 0.959311 [56] valid_0's auc: 0.959324 [57] valid_0's auc: 0.959348 [58] valid_0's auc: 0.959435 [59] valid_0's auc: 0.959463 [60] valid_0's auc: 0.95949 [61] valid_0's auc: 0.959562 [62] valid_0's auc: 0.959721 [63] valid_0's auc: 0.959729 [64] valid_0's auc: 0.959773 [65] valid_0's auc: 0.959809 [66] valid_0's auc: 0.959868 [67] valid_0's auc: 0.959921 [68] valid_0's auc: 0.959994 [69] valid_0's auc: 0.960065 [70] valid_0's auc: 0.96011 [71] valid_0's auc: 0.960133 [72] valid_0's auc: 0.960275 [73] valid_0's auc: 0.960299 [74] valid_0's auc: 0.960336 [75] valid_0's auc: 0.960365 [76] valid_0's auc: 0.960411 [77] valid_0's auc: 0.960488 [78] valid_0's auc: 0.960523 [79] valid_0's auc: 0.960563 [80] valid_0's auc: 0.960624 [81] valid_0's auc: 0.960665 [82] valid_0's auc: 0.960724 [83] valid_0's auc: 0.960724 [84] valid_0's auc: 0.960751 [85] valid_0's auc: 0.960799 [86] valid_0's auc: 0.960853 [87] valid_0's auc: 0.960876 [88] valid_0's auc: 0.960934 [89] valid_0's auc: 0.961012 [90] valid_0's auc: 0.961012 [91] valid_0's auc: 0.961065 [92] valid_0's auc: 0.961095 [93] valid_0's auc: 0.961131 [94] valid_0's auc: 0.961136 [95] valid_0's auc: 0.961155 [96] valid_0's auc: 0.961191 [97] valid_0's auc: 0.961189 [98] valid_0's auc: 0.961189 [99] valid_0's auc: 0.961224 [100] valid_0's auc: 0.961228 [101] valid_0's auc: 0.96125 [102] valid_0's auc: 0.961259 [103] valid_0's auc: 0.961289 [104] valid_0's auc: 0.961309 [105] valid_0's auc: 0.961309 [106] valid_0's auc: 0.96134 [107] valid_0's auc: 0.961373 [108] valid_0's auc: 0.961382 [109] valid_0's auc: 0.961391 [110] valid_0's auc: 0.961402 [111] valid_0's auc: 0.961449 [112] valid_0's auc: 0.96145 [113] valid_0's auc: 0.961482 [114] valid_0's auc: 0.961481 [115] valid_0's auc: 0.961492 [116] valid_0's auc: 0.961513 [117] valid_0's auc: 0.961531 [118] valid_0's auc: 0.961539 [119] valid_0's auc: 0.961563 [120] valid_0's auc: 0.961563 [121] valid_0's auc: 0.961568 [122] valid_0's auc: 0.961588 [123] valid_0's auc: 0.961599 [124] valid_0's auc: 0.961605 [125] valid_0's auc: 0.961605 [126] valid_0's auc: 0.96161 [127] valid_0's auc: 0.961626 [128] valid_0's auc: 0.961626 [129] valid_0's auc: 0.96163 [130] valid_0's auc: 0.961646 [131] valid_0's auc: 0.961678 [132] valid_0's auc: 0.961672 [133] valid_0's auc: 0.961673 [134] valid_0's auc: 0.96171 [135] valid_0's auc: 0.96171 [136] valid_0's auc: 0.961724 [137] valid_0's auc: 0.961723 [138] valid_0's auc: 0.961726 [139] valid_0's auc: 0.961731 [140] valid_0's auc: 0.961736 [141] valid_0's auc: 0.961751 [142] valid_0's auc: 0.961759 [143] valid_0's auc: 0.961777 [144] valid_0's auc: 0.961777 [145] valid_0's auc: 0.961779 [146] valid_0's auc: 0.961782 [147] valid_0's auc: 0.961782 [148] valid_0's auc: 0.961796 [149] valid_0's auc: 0.961799 [150] valid_0's auc: 0.961806 [151] valid_0's auc: 0.961804 [152] valid_0's auc: 0.961805 [153] valid_0's auc: 0.961794 [154] valid_0's auc: 0.961802 [155] valid_0's auc: 0.961805 [156] valid_0's auc: 0.961821 [157] valid_0's auc: 0.961853 [158] valid_0's auc: 0.96187 [159] valid_0's auc: 0.961875 [160] valid_0's auc: 0.961877 [161] valid_0's auc: 0.961889 [162] valid_0's auc: 0.961894 [163] valid_0's auc: 0.961898 [164] valid_0's auc: 0.961901 [165] valid_0's auc: 0.961911 [166] valid_0's auc: 0.961911 [167] valid_0's auc: 0.961915 [168] valid_0's auc: 0.961925 [169] valid_0's auc: 0.961925 [170] valid_0's auc: 0.961929 [171] valid_0's auc: 0.961949 [172] valid_0's auc: 0.961945 [173] valid_0's auc: 0.961945 [174] valid_0's auc: 0.961944 [175] valid_0's auc: 0.961946 [176] valid_0's auc: 0.961952 [177] valid_0's auc: 0.961956 [178] valid_0's auc: 0.961958 [179] valid_0's auc: 0.961971 [180] valid_0's auc: 0.961998 [181] valid_0's auc: 0.961998 [182] valid_0's auc: 0.962014 [183] valid_0's auc: 0.962018 [184] valid_0's auc: 0.962016 [185] valid_0's auc: 0.962022 [186] valid_0's auc: 0.962031 [187] valid_0's auc: 0.96203 [188] valid_0's auc: 0.962021 [189] valid_0's auc: 0.962021 [190] valid_0's auc: 0.962022 [191] valid_0's auc: 0.962026 [192] valid_0's auc: 0.962038 [193] valid_0's auc: 0.962042 [194] valid_0's auc: 0.962041 [195] valid_0's auc: 0.962035 [196] valid_0's auc: 0.962037 [197] valid_0's auc: 0.962048 [198] valid_0's auc: 0.962054 [199] valid_0's auc: 0.962052 [200] valid_0's auc: 0.962054 [201] valid_0's auc: 0.962041 [202] valid_0's auc: 0.962041 [203] valid_0's auc: 0.962052 [204] valid_0's auc: 0.962051 [205] valid_0's auc: 0.962056 [206] valid_0's auc: 0.962056 [207] valid_0's auc: 0.962069 [208] valid_0's auc: 0.962072 [209] valid_0's auc: 0.962072 [210] valid_0's auc: 0.962062 [211] valid_0's auc: 0.962064 [212] valid_0's auc: 0.962066 [213] valid_0's auc: 0.962066 [214] valid_0's auc: 0.962066 [215] valid_0's auc: 0.962064 [216] valid_0's auc: 0.96206 [217] valid_0's auc: 0.962059 [218] valid_0's auc: 0.962059 Early stopping, best iteration is: [208] valid_0's auc: 0.962072
评估模型
最后,通过训练模型,我们在测试集上评估其性能。
from sklearn import metrics ypred = bst.predict(test[feature_cols]) score = metrics.roc_auc_score(test['is_attributed'], ypred) print(f"Test score: {score}")
Test score: 0.9726727334566094
这将是我们对该模型的基准分数。 当我们转换要素,添加新要素或执行要素选择时,我们应该在此分数上有所改进。 但是,由于这是测试集,因此我们只想在所有操作结束时对其进行查看。 在本课程的最后,您将再次查看测试成绩,以查看是否在基线模型上有所改善。