1 概述
2 处理思想学习
2.1 移除异常值
Long steaks of constant values
- 恒定值的长条纹
Large positive/negative spikes - 极端的大尖峰
2.2 缺失值
2.3 目标函数
2.4 特征工程
- categorical interactions such as concatenation of building_id and meter
串联building_id和meter产生新的categorical featurebuilding_id_meter - count frequency of feautures
计算特征的数量 - Smoothed and 1st, 2nd-order differentiation temperature features using Savitzky-Golay filter.
- Cyclic encoding of periodic features; e.g., hour gets mapped to hour_x = cos(2pihour/24) and hour_y = sin(2pihour/24)
这个很骚,就是对于循环特征的编码,用cos和sin进行编码 - Bayesian target encoding
这个是作者自己写的一种target编码,下面会详细讲一下 - 3rd 的思路:作者因为缺乏时间,仅仅消除了一些异常值。使用的是当同一时间同一地区都出现0的时候,消除他们,然后消除了一些最大的异常值。
- 温度的滞后,多个高分作者都提到过。
2.4.1 Savitzky-Golay filter
- Savitzky-Golay卷积平滑算法是移动平滑算法的改进。
- Savitzky-Golay卷积平滑关键在于矩阵算子的求解。
- 第一个图蓝色线是原数据
- 第一个图黄色线是用G-S平滑后的数据
- 第二个图蓝色线是G-S平滑后的数据的一阶导数
- 第二个图黄色线是G-S平滑后的数据的二阶导数
2.4.2 Bayesian target encoding(python实现)
import gc
import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error
class GaussianTargetEncoder():
def __init__(self, group_cols, target_col="target", prior_cols=None):
self.group_cols = group_cols
self.target_col = target_col
self.prior_cols = prior_cols
def _get_prior(self, df):
if self.prior_cols is None:
prior = np.full(len(df), df[self.target_col].mean())
prior = df[self.prior_cols].mean(1)
return prior
def fit(self, df):
self.stats = df.assign(mu_prior=self._get_prior(df), y=df[self.target_col])
self.stats = self.stats.groupby(self.group_cols).agg(
n = ("y", "count"),
mu_mle = ("y", np.mean),
sig2_mle = ("y", np.var),
mu_prior = ("mu_prior", np.mean),
def transform(self, df, prior_precision=1000, stat_type="mean"):
precision = prior_precision + self.stats.n/self.stats.sig2_mle
if stat_type == "mean":
numer = prior_precision*self.stats.mu_prior
+ self.stats.n/self.stats.sig2_mle*self.stats.mu_mle
denom = precision
elif stat_type == "var":
numer = 1.0
denom = precision
elif stat_type == "precision":
numer = precision
denom = 1.0
raise ValueError(f"stat_type={stat_type} not recognized.")
mapper = dict(zip(self.stats.index, numer / denom))
if isinstance(self.group_cols, str):
keys = df[self.group_cols].values.tolist()
elif len(self.group_cols) == 1:
keys = df[self.group_cols[0]].values.tolist()
keys = zip(*[df[x] for x in self.group_cols])
values = np.array([mapper.get(k) for k in keys]).astype(float)
prior = self._get_prior(df)
values[~np.isfinite(values)] = prior[~np.isfinite(values)]
return values
def fit_transform(self, df, *args, **kwargs):
return self.transform(df, *args, **kwargs)
# load data
train = pd.read_csv("/kaggle/input/ashrae-energy-prediction/train.csv")
test = pd.read_csv("/kaggle/input/ashrae-energy-prediction/test.csv")
# create target
train["target"] = np.log1p(train.meter_reading)
test["target"] = train.target.mean()
# create time features
def add_time_features(df):
df.timestamp = pd.to_datetime(df.timestamp)
df["hour"] = df.timestamp.dt.hour
df["weekday"] = df.timestamp.dt.weekday
df["month"] = df.timestamp.dt.month
# define groupings and corresponding priors
groups_and_priors = {
# singe encodings
("hour",): None,
("weekday",): None,
("month",): None,
("building_id",): None,
("meter",): None,
# second-order interactions
("meter", "hour"): ["gte_meter", "gte_hour"],
("meter", "weekday"): ["gte_meter", "gte_weekday"],
("meter", "month"): ["gte_meter", "gte_month"],
("meter", "building_id"): ["gte_meter", "gte_building_id"],
# higher-order interactions
("meter", "building_id", "hour"): ["gte_meter_building_id", "gte_meter_hour"],
("meter", "building_id", "weekday"): ["gte_meter_building_id", "gte_meter_weekday"],
("meter", "building_id", "month"): ["gte_meter_building_id", "gte_meter_month"],
features = []
for group_cols, prior_cols in groups_and_priors.items():
gte = GaussianTargetEncoder(list(group_cols), "target", prior_cols)
train[features[-1]] = gte.fit_transform(train, PRIOR_PRECISION)
test[features[-1]] = gte.transform(test, PRIOR_PRECISION)
train_preds = np.zeros(len(train))
test_preds = np.zeros(len(test))
for m in range(4):
print(f"Meter {m}", end="")
# instantiate model
model = RidgeCV(
alphas=np.logspace(-10, 1, 25),
# fit model
X=train.loc[train.meter==m, features].values,
y=train.loc[train.meter==m, "target"].values
# make predictions
train_preds[train.meter==m] = model.predict(train.loc[train.meter==m, features].values)
test_preds[test.meter==m] = model.predict(test.loc[test.meter==m, features].values)
# transform predictions
train_preds[train_preds < 0] = 0
train_preds[train.meter==m] = np.expm1(train_preds[train.meter==m])
test_preds[test_preds < 0] = 0
test_preds[test.meter==m] = np.expm1(test_preds[test.meter==m])
# evaluate model
meter_rmsle = rmsle(
train.loc[train.meter==m, "meter_reading"].values
print(f", rmsle={meter_rmsle:0.5f}")
print(f"Overall rmsle={rmsle(train_preds, train.meter_reading.values):0.5f}")
del train, train_preds, test
2.5 models ensemble
- 2nd的思想:Due to the size of the dataset and difficulty in setting up a robust validation framework, we did not focus much on feature engineering, fearing it might not extrapolate cleanly to the test data. Instead we chose to ensemble as many different models as possible to capture more information and help the predictions to be stable across years.
根据他们过去的经验,在没有可靠的验证框架的情况下,构建好的特征是非常棘手的 - 2nd的思想:We bagged a bunch of boosting models XGB, LGBM, CB at various levels of data: Models for every site+meter, models for every building+meter, models for every building-type+meter and models using entire train data. It was very useful to build a separate model for each site so that the model could capture site-specific patterns and each site could be fitted with a different parameter set suitable for it. It also automatically solved for issues like timestamp alignment and feature measurement scale being different across sites so we didn't have to solve for them separately.
为每一个建立单独的model,作者大概为这次比赛总共建立了超过5000个models进行融合 - 3nd的思想:Given diverse experiments with different CV schemes I did over the period of the competition, I decided to simply combine all the results (over 30), I got into a single submission using a simple average after selection by pearson correlation (6th on private LB).
作者因为时间不充裕,所以简单的融合了所有的结果,超过30个的结果。作者简单的使用平均的方法。然后使用peason correlation皮尔逊相关系数来选择平均那几个结果。 - 不得不说,就算lightGBM的性能高于XGB和Catboost,但是这三个都是要用在比赛中的,可能是能提取不同的信息。有的人还会使用CNN和FFNN
2.5.1 pearson correlation(+python 实现)
Pearson相关系数(Pearson CorrelationCoefficient)是用来衡量两个数据集合是否在一条线上面,它用来衡量定距变量间的线性关系。
from scipy.stats.stats import pearsonr
pearsonr(x, y)
from pydoc import help
from scipy.stats.stats import pearsonr
Help on function pearsonr in module scipy.stats.stats:
pearsonr(x, y)
Calculates a Pearson correlation coefficient and the p-value for testing
The Pearson correlation coefficient measures the linear relationship
between two datasets. Strictly speaking, Pearson's correlation requires
that each dataset be normally distributed. Like other correlation
coefficients, this one varies between -1 and +1 with 0 implying no
correlation. Correlations of -1 or +1 imply an exact linear
relationship. Positive correlations imply that as x increases, so does
y. Negative correlations imply that as x increases, y decreases.
The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets. The p-values are not entirely
reliable but are probably reasonable for datasets larger than 500 or so.
x : 1D array
y : 1D array the same length as x
(Pearson's correlation coefficient,
2-tailed p-value)
2.6 Why does postprocessing work? 2nd place magic
Why does postprocessing work? 2nd place magic
- 2nd 的思路:在预测之后,得到的数据乘上一个常数,一般是0.8~1.1,这个就叫postprocess.
