zoukankan      html  css  js  c++  java
  • 机器学习sklearn(五): 数据处理(二)缺失值处理

    来源 https://www.cnblogs.com/B-Hanan/articles/12774433.html

    1 单变量缺失

    import numpy as np
    from sklearn.impute import SimpleImputer

    help(SimpleImputer):

    class SimpleImputer(_BaseImputer):Imputation transformer for completing missing values.

    Parameters(参数设置)

    missing_values(缺失值类型) : number, string, np.nan (default) or None

    The placeholder for the missing values. All occurrences of missing_values will be imputed.

    strategy : string, default='mean'

    The imputation strategy.

    • If "mean", then replace missing values using the mean along each column. Can only be used with numeric data.

    • If "median", then replace missing values using the median along each column. Can only be used with numeric data.

    • If "most_frequent", then replace missing using the most frequent value along each column. Can be used with strings or numeric data.

    • If "constant", then replace missing values with fill_value. Can be used with strings or numeric data.strategy="constant" for fixed value imputation.

    fill_value : string or numerical value, default=None

    When strategy == "constant", fill_value is used to replace all occurrences of missing_values.If left to the default, fill_value will be 0 when imputing numericaldata and "missing_value" for strings or object data types.

    imp=SimpleImputer(missing_values=np.nan,strategy='mean')
    imp.fit([[1,2],[np.nan,3],[7,6]])
    SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                  missing_values=nan, strategy='mean', verbose=0)
    ##SimpleImputer类支持稀疏矩阵
    import scipy.sparse as sp
    X=sp.csc_matrix([[1,2],[0,-1],[8,4]])
    imp=SimpleImputer(missing_values=-1,strategy='mean')
    imp.fit(X)
    SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                  missing_values=-1, strategy='mean', verbose=0)
    X_test=sp.csc_matrix([[-1,2],[6,-1],[7,6]])
    print(imp.transform(X_test))
      (0, 0)    3.0
      (1, 0)    6.0
      (2, 0)    7.0
      (0, 1)    2.0
      (1, 1)    3.0
      (2, 1)    6.0
    print(imp.transform(X_test).toarray())
    [[3. 2.]
     [6. 3.]
     [7. 6.]]
    
    import pandas as pd
    df=pd.DataFrame([['a','x'],
                    [np.nan,'y'],
                    ['a',np.nan],
                    ['b','y']],dtype='category')
    
    df
     01
    0 a x
    1 NaN y
    2 a NaN
    3 b y
    imp=SimpleImputer(strategy='most_frequent')
    print(imp.fit_transform(df))
    [['a' 'x']
     ['a' 'y']
     ['a' 'y']
     ['b' 'y']]

    2 多元特征估计

    使用IterativeImputer类,它将每一个特征的缺失值建模为其它特性的函数,并使用该估计值进行估计。
    工作模式:迭代循环
    在每一步,都指定一个功能列出作为输出

    import numpy as np
    from sklearn.experimental import enable_iterative_imputer
    from sklearn.impute import IterativeImputer
    imp=IterativeImputer(max_iter=10,random_state=0)
    imp.fit([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])
    IterativeImputer(add_indicator=False, estimator=None,
                     imputation_order='ascending', initial_strategy='mean',
                     max_iter=10, max_value=None, min_value=None,
                     missing_values=nan, n_nearest_features=None, random_state=0,
                     sample_posterior=False, skip_complete=False, tol=0.001,
                     verbose=0)
    imp.transform([[1,2],[3,6],[4,8],[np.nan,3],[7,np.nan]])
    array([[ 1.        ,  2.        ],
           [ 3.        ,  6.        ],
           [ 4.        ,  8.        ],
           [ 1.50004509,  3.        ],
           [ 7.        , 14.00004135]])
    X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
    print(imp.transform(X_test))
    [[ 1.00007297  2.        ]
     [ 6.         12.00002754]
     [ 2.99996145  6.        ]]

    3 K-近邻法

    这个KNNImputer类提供了使用k-最近邻方法填充缺失值的估算。默认情况下,支持缺失值的欧氏距离度量,nan_euclidean_distances,用于查找最近的邻居。每个缺失的特性都使用n_neighbors具有该功能值的最近邻居。

    from sklearn.impute import KNNImputer
    

    help(KNNImputer):

    Imputation for completing missing values using k-Nearest Neighbors.

    (使用k近邻方法补全缺失值。)

    Each sample's missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

    (每个样本的缺失值是使用在训练集中找到的最近邻居的‘n_neighbors’的平均值来推算的.如果两个都不缺少的要素都不接近,则两个样本是接近的。)

    Parameters:

    missing_values : number, string, np.nan or None, default=np.nan

    The placeholder for the missing values. All occurrences of missing_values will be imputed.

    n_neighbors : int, default=5 Number of neighboring samples to use for imputation.

    weights : {'uniform', 'distance'} or callable, default='uniform' Weight function used in prediction.

    import numpy as np
    from sklearn.impute import KNNImputer
    nan = np.nan
    X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
    print(X)
    imputer = KNNImputer(n_neighbors=2, weights="uniform")
    imputer.fit_transform(X)
    
    [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
    
    
    
    
    
    array([[1. , 2. , 4. ],
           [3. , 4. , 3. ],
           [5.5, 6. , 5. ],
           [8. , 8. , 7. ]])

    4 标记推算值

    这个MissingIndicator转换器用于将数据集转换为相应的二进制矩阵,以指示数据集中是否存在缺失值。这种转换与计算相结合是很有用的。在使用估算时,保存有关哪些值丢失的信息可以提供信息。

    from sklearn.impute import MissingIndicator
    

    help(MissingIndicator):

    class MissingIndicator(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)

    Binary indicators for missing values(缺失值的二进制指示符).

    MissingIndicator(missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)

    X = np.array([[-1, -1, 1, 3],
                  [4, -1, 0, -1],
                  [8, -1, 1, 0]])
    indicator = MissingIndicator(missing_values=-1)
    mask_missing_values_only = indicator.fit_transform(X)
    mask_missing_values_only
    
    array([[ True,  True, False],
           [False,  True,  True],
           [False,  True, False]])
    
    #只返回存在缺失值的列的索引
    indicator.features_
    
    array([0, 1, 2, 3])
    
    #这个features参数可以设置为'all'若要返回所有特征,无论它们是否包含缺失的值
    indicator = MissingIndicator(missing_values=-1, features="all")
    mask_all = indicator.fit_transform(X)
    mask_all
    
    array([[ True,  True, False, False],
           [False,  True, False,  True],
           [False,  True, False, False]])
    
    indicator.features_
    #特征所在的列索引
    
    array([0, 1, 2, 3])
  • 相关阅读:
    Tomcat详解系列(3)
    Tomcat详解系列(2)
    Tomcat详解系列(1)
    常用开发库
    单元测试
    [MongoDB知识体系] 一文全面总结MongoDB知识体系
    问题记录:net::ERR_CERT_AUTHORITY_INVALID
    CSS+DIV特色开关按钮
    Jquery的Ajax简易优化思路
    CSS+DIV简易灯泡案例
  • 原文地址:https://www.cnblogs.com/qiu-hua/p/14891589.html
Copyright © 2011-2022 走看看