zoukankan      html  css  js  c++  java
  • 机器学习sklearn(九): 特征工程(二)特征离散化(二)特征二值化

    特征二值化 是 将数值特征用阈值过滤得到布尔值 的过程。这对于下游的概率型模型是有用的,它们假设输入数据是多值 伯努利分布(Bernoulli distribution) 。例如这个示例 sklearn.neural_network.BernoulliRBM 。

    即使归一化计数(又名术语频率)和TF-IDF值特征在实践中表现稍好一些,文本处理团队也常常使用二值化特征值(这可能会简化概率估计)。

    相比于 Normalizer ,实用程序类 Binarizer 也被用于 sklearn.pipeline.Pipeline 的早期步骤中。因为每个样本被当做是独立于其他样本的,所以 fit 方法是无用的:

    >>> X = [[ 1., -1.,  2.],
    ...      [ 2.,  0.,  0.],
    ...      [ 0.,  1., -1.]]
    
    >>> binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
    >>> binarizer
    Binarizer(copy=True, threshold=0.0)
    
    >>> binarizer.transform(X)
    array([[ 1.,  0.,  1.],
     [ 1.,  0.,  0.],
     [ 0.,  1.,  0.]])

    也可以为二值化器赋一个阈值:

    >>> binarizer = preprocessing.Binarizer(threshold=1.1)
    >>> binarizer.transform(X)
    array([[ 0.,  0.,  1.],
     [ 1.,  0.,  0.],
     [ 0.,  0.,  0.]])

    相比于 StandardScaler 和 Normalizer 类的情况,预处理模块提供了一个相似的函数 binarize ,以便不需要转换接口时使用。

    稀疏输入

    binarize 以及 Binarizer 接收 来自scipy.sparse的密集类数组数据以及稀疏矩阵作为输入 。

    对于稀疏输入,数据被 转化为压缩的稀疏行形式 (参见 scipy.sparse.csr_matrix )。为了避免不必要的内存复制,推荐在上游选择CSR表示。

    class sklearn.preprocessing.Binarizer(*threshold=0.0copy=True)

    Binarize data (set feature values to 0 or 1) according to a threshold.

    Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

    Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

    It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).

    Read more in the User Guide.

    Parameters
    thresholdfloat, default=0.0

    Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.

    copybool, default=True

    set to False to perform inplace binarization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

    Methods

    fit(X[, y])

    Do nothing and return the estimator unchanged.

    fit_transform(X[, y])

    Fit to data, then transform it.

    get_params([deep])

    Get parameters for this estimator.

    set_params(**params)

    Set the parameters of this estimator.

    transform(X[, copy])

    Binarize each element of X.

    Examples

    >>> from sklearn.preprocessing import Binarizer
    >>> X = [[ 1., -1.,  2.],
    ...      [ 2.,  0.,  0.],
    ...      [ 0.,  1., -1.]]
    >>> transformer = Binarizer().fit(X)  # fit does nothing.
    >>> transformer
    Binarizer()
    >>> transformer.transform(X)
    array([[1., 0., 1.],
           [1., 0., 0.],
           [0., 1., 0.]])
  • 相关阅读:
    (转)使用介质设备安装 AIX 以通过 HMC 安装分区
    (转)在 VMware 中安装 HMC
    (转)50-100台中小规模网站集群搭建实战项目(超实用企业集群)
    (转)awk数组详解及企业实战案例
    (转) IP子网划分
    教你如何迅速秒杀掉:99%的海量数据处理面试题(转)
    PHP对大文件的处理思路
    十道海量数据处理面试题与十个方法大总结
    mysql查询更新时的锁表机制分析
    mysql数据库问答
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14903434.html
Copyright © 2011-2022 走看看