zoukankan      html  css  js  c++  java
  • 机器学习sklearn(八): 特征工程(一)特征离散化(一)K-bins 离散化

    离散化 (Discretization) (有些时候叫 量化(quantization) 或 装箱(binning)) 提供了将连续特征划分为离散特征值的方法。 某些具有连续特征的数据集会受益于离散化,因为 离散化可以把具有连续属性的数据集变换成只有名义属性(nominal attributes)的数据集。 (译者注: nominal attributes 其实就是 categorical features, 可以译为 名称属性,名义属性,符号属性,离散属性 等)

    One-hot 编码的离散化特征 可以使得一个模型更加的有表现力(expressive),同时还能保留其可解释性(interpretability)。 比如,用离散化器进行预处理可以给线性模型引入非线性。

    KBinsDiscretizer 类使用k个等宽的bins把特征离散化

    >>> X = np.array([[ -3., 5., 15 ],
    ...               [  0., 6., 14 ],
    ...               [  6., 3., 11 ]])
    >>> est = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal').fit(X)

    默认情况下,输出是被 one-hot 编码到一个稀疏矩阵。(请看类别特征编码)。 而且可以使用参数encode进行配置。对每一个特征, bin的边界以及总数目在 fit过程中被计算出来,它们将用来定义区间。 因此,对现在的示例,这些区间间隔被定义如下:

    • 特征 1:[-∞,-1],[-1,2),[2,∞)
    • 特征 2:[-∞,5),[5,∞)
    • 特征 3:[-∞,14],[14,∞)

    基于这些 bin 区间, X 就被变换成下面这样:

    >>> est.transform(X)                      
    array([[ 0., 1., 1.],
           [ 1., 1., 1.],
           [ 2., 0., 0.]])

    由此产生的数据集包含了有序属性(ordinal attributes),可以被进一步用在类 sklearn.pipeline.Pipeline 中。

    离散化(Discretization)类似于为连续数据构建直方图(histograms)。 然而,直方图聚焦于统计特征落在特定的bins里面的数量,而离散化聚焦于给这些bins分配特征取值。

    KBinsDiscretizer类实现了不同的 binning策略,可以通过参数strategy进行选择。 ‘uniform’ 策略使用固定宽度的bins。 ‘quantile’ 策略在每个特征上使用分位数(quantiles)值以便具有相同填充的bins。 ‘kmeans’ 策略基于在每个特征上独立执行的k-means聚类过程定义bins。


    class sklearn.preprocessing.KBinsDiscretizer(n_bins=5*encode='onehot'strategy='quantile'dtype=None)

    Bin continuous data into intervals.

    Read more in the User Guide.

    New in version 0.20.

    n_binsint or array-like of shape (n_features,), default=5

    The number of bins to produce. Raises ValueError if n_bins 2.

    encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, default=’onehot’

    Method used to encode the transformed result.


    Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.


    Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.


    Return the bin identifier encoded as an integer value.

    strategy{‘uniform’, ‘quantile’, ‘kmeans’}, default=’quantile’

    Strategy used to define the widths of the bins.


    All bins in each feature have identical widths.


    All bins in each feature have the same number of points.


    Values in each bin have the same nearest center of a 1D k-means cluster.

    dtype{np.float32, np.float64}, default=None

    The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported.

    New in version 0.24.

    n_bins_ndarray of shape (n_features,), dtype=np.int_

    Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.

    bin_edges_ndarray of ndarray of shape (n_features,)

    The edges of each bin. Contain arrays of varying shapes (n_bins_, ) Ignored features will have empty arrays.


    fit(X[, y])

    Fit the estimator.

    fit_transform(X[, y])

    Fit to data, then transform it.


    Get parameters for this estimator.


    Transform discretized data back to original feature space.


    Set the parameters of this estimator.


    Discretize the data.


    >>> X = [[-2, 1, -4,   -1],
    ...      [-1, 2, -3, -0.5],
    ...      [ 0, 3, -2,  0.5],
    ...      [ 1, 4, -1,    2]]
    >>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
    >>> est.fit(X)
    >>> Xt = est.transform(X)
    >>> Xt  
    array([[ 0., 0., 0., 0.],
           [ 1., 1., 1., 0.],
           [ 2., 2., 2., 1.],
           [ 2., 2., 2., 2.]])

    Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.

    >>> est.bin_edges_[0]
    array([-2., -1.,  0.,  1.])
    >>> est.inverse_transform(Xt)
    array([[-1.5,  1.5, -3.5, -0.5],
           [-0.5,  2.5, -2.5, -0.5],
           [ 0.5,  3.5, -1.5,  0.5],
           [ 0.5,  3.5, -1.5,  1.5]])
  • 相关阅读:
    Block pool ID needed, but service not yet registered with NN java.lang.Exception: trace 异常解决
    ContextCleaner ——Spark 应用程序的垃圾回收器
    重新认识Java 8的HashMap
    HDFS NameNode内存全景
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14903384.html
Copyright © 2011-2022 走看看