离散化 (Discretization) (有些时候叫 量化(quantization) 或 装箱(binning)) 提供了将连续特征划分为离散特征值的方法。 某些具有连续特征的数据集会受益于离散化,因为 离散化可以把具有连续属性的数据集变换成只有名义属性(nominal attributes)的数据集。 (译者注: nominal attributes 其实就是 categorical features, 可以译为 名称属性,名义属性,符号属性,离散属性 等)
One-hot 编码的离散化特征 可以使得一个模型更加的有表现力(expressive),同时还能保留其可解释性(interpretability)。 比如,用离散化器进行预处理可以给线性模型引入非线性。
KBinsDiscretizer 类使用k个等宽的bins把特征离散化
>>> X = np.array([[ -3., 5., 15 ], ... [ 0., 6., 14 ], ... [ 6., 3., 11 ]]) >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)
默认情况下,输出是被 one-hot 编码到一个稀疏矩阵。(请看类别特征编码)。 而且可以使用参数encode
进行配置。对每一个特征, bin的边界以及总数目在 fit
过程中被计算出来,它们将用来定义区间。 因此,对现在的示例,这些区间间隔被定义如下:
- 特征 1:
[-∞,-1],[-1,2),[2,∞)
- 特征 2:
[-∞,5),[5,∞)
- 特征 3:
[-∞,14],[14,∞)
基于这些 bin 区间, X 就被变换成下面这样:
>>> est.transform(X) array([[ 0., 1., 1.], [ 1., 1., 1.], [ 2., 0., 0.]])
由此产生的数据集包含了有序属性(ordinal attributes),可以被进一步用在类 sklearn.pipeline.Pipeline 中。
离散化(Discretization)类似于为连续数据构建直方图(histograms)。 然而,直方图聚焦于统计特征落在特定的bins里面的数量,而离散化聚焦于给这些bins分配特征取值。
KBinsDiscretizer
类实现了不同的 binning策略,可以通过参数strategy
进行选择。 ‘uniform’ 策略使用固定宽度的bins。 ‘quantile’ 策略在每个特征上使用分位数(quantiles)值以便具有相同填充的bins。 ‘kmeans’ 策略基于在每个特征上独立执行的k-means聚类过程定义bins。
示例
- Using KBinsDiscretizer to discretize continuous features
- Feature discretization
- Demonstrating the different strategies of KBinsDiscretizer
class sklearn.preprocessing.
KBinsDiscretizer
(n_bins=5, *, encode='onehot', strategy='quantile', dtype=None)
Bin continuous data into intervals.
Read more in the User Guide.
New in version 0.20.
- Parameters
- n_binsint or array-like of shape (n_features,), default=5
-
The number of bins to produce. Raises ValueError if
n_bins < 2
. - encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, default=’onehot’
-
Method used to encode the transformed result.
- onehot
-
Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.
- onehot-dense
-
Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.
- ordinal
-
Return the bin identifier encoded as an integer value.
- strategy{‘uniform’, ‘quantile’, ‘kmeans’}, default=’quantile’
-
Strategy used to define the widths of the bins.
- uniform
-
All bins in each feature have identical widths.
- quantile
-
All bins in each feature have the same number of points.
- kmeans
-
Values in each bin have the same nearest center of a 1D k-means cluster.
- dtype{np.float32, np.float64}, default=None
-
The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported.
New in version 0.24.
- Attributes
- n_bins_ndarray of shape (n_features,), dtype=np.int_
-
Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.
- bin_edges_ndarray of ndarray of shape (n_features,)
-
The edges of each bin. Contain arrays of varying shapes
(n_bins_, )
Ignored features will have empty arrays.
Methods
|
Fit the estimator. |
|
Fit to data, then transform it. |
|
Get parameters for this estimator. |
Transform discretized data back to original feature space. |
|
|
Set the parameters of this estimator. |
|
Discretize the data. |
Examples
>>> X = [[-2, 1, -4, -1], ... [-1, 2, -3, -0.5], ... [ 0, 3, -2, 0.5], ... [ 1, 4, -1, 2]] >>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') >>> est.fit(X) KBinsDiscretizer(...) >>> Xt = est.transform(X) >>> Xt array([[ 0., 0., 0., 0.], [ 1., 1., 1., 0.], [ 2., 2., 2., 1.], [ 2., 2., 2., 2.]])
Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform
function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.
>>> est.bin_edges_[0] array([-2., -1., 0., 1.]) >>> est.inverse_transform(Xt) array([[-1.5, 1.5, -3.5, -0.5], [-0.5, 2.5, -2.5, -0.5], [ 0.5, 3.5, -1.5, 0.5], [ 0.5, 3.5, -1.5, 1.5]])