zoukankan      html  css  js  c++  java
  • Handling Missing Values

    1) A Simple Option: Drop Columns with Missing Values

    如果这些列具有有用信息(在未丢失的位置),则在删除列时,模型将失去对此信息的访问权限。 此外,如果您的测试数据在您的训练数据没有的地方缺少值,则会导致错误。

    data_without_missing_values = original_data.dropna(axis=1)
    
    #同时操作tran和test部分
    cols_with_missing = [col for col in original_data.columns 
                                     if original_data[col].isnull().any()]
    redued_original_data = original_data.drop(cols_with_missing, axis=1)
    reduced_test_data = test_data.drop(cols_with_missing, axis=1)

    2) A Better Option: Imputation

    默认行为填写了插补的平均值。 统计学家已经研究了更复杂的策略,但是一旦将结果插入复杂的机器学习模型,那些复杂的策略通常没有任何好处。

    关于Imputation的一个(很多)好处是它可以包含在scikit-learn Pipeline中。 管道简化了模型构建,模型验证和模型部署。

    from sklearn.impute import SimpleImputer
    my_imputer = SimpleImputer()
    data_with_imputed_values = my_imputer.fit_transform(original_data)

    3) An Extension To Imputation

    估算是标准方法,通常效果很好。 但是,估算值可能系统地高于或低于其实际值(未在数据集中收集)。 或者具有缺失值的行可能以某种其他方式看来是唯一的。 在这种情况下,您的模型会通过考虑最初缺少哪些值来做出更好的预测。

    # make copy to avoid changing original data (when Imputing)
    new_data = original_data.copy()
    
    # make new columns indicating what will be imputed
    cols_with_missing = (col for col in new_data.columns 
                                     if new_data[col].isnull().any())
    for col in cols_with_missing:
        new_data[col + '_was_missing'] = new_data[col].isnull()
    
    # Imputation
    my_imputer = SimpleImputer()
    new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
    new_data.columns = original_data.columns

    Example (Comparing All Solutions)

    import pandas as pd
    
    # Load data
    melb_data = pd.read_csv('../input/melbourne-housing-snapshot/melb_data.csv')
    
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split
    
    melb_target = melb_data.Price
    melb_predictors = melb_data.drop(['Price'], axis=1)
    
    # For the sake of keeping the example simple, we'll use only numeric predictors. 
    melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])
    
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(melb_numeric_predictors, 
                                                        melb_target,
                                                        train_size=0.7, 
                                                        test_size=0.3, 
                                                        random_state=0)
    
    def score_dataset(X_train, X_test, y_train, y_test):
        model = RandomForestRegressor()
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        return mean_absolute_error(y_test, preds)
    
    
    # Get Model Score from Dropping Columns with Missing Values
    # 直接丢弃含有缺失值的列 cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()] reduced_X_train = X_train.drop(cols_with_missing, axis=1) reduced_X_test = X_test.drop(cols_with_missing, axis=1) print("Mean Absolute Error from dropping columns with Missing Values:") print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test)) # Get Model Score from Imputation
    # 插入值 from sklearn.impute import SimpleImputer my_imputer = SimpleImputer() imputed_X_train = my_imputer.fit_transform(X_train) imputed_X_test = my_imputer.transform(X_test) print("Mean Absolute Error from Imputation:") print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test)) # Get Score from Imputation with Extra Columns Showing What Was Imputed
    # 添加额外列显示缺失值
    imputed_X_train_plus = X_train.copy() imputed_X_test_plus = X_test.copy() cols_with_missing = (col for col in X_train.columns if X_train[col].isnull().any()) for col in cols_with_missing: imputed_X_train_plus[col + '_was_missing'] = imputed_X_train_plus[col].isnull() imputed_X_test_plus[col + '_was_missing'] = imputed_X_test_plus[col].isnull() # Imputation my_imputer = SimpleImputer() imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus) imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus) print("Mean Absolute Error from Imputation while Track What Was Imputed:") print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))
  • 相关阅读:
    to_char &&to_date
    java中Integer 与 String 类型的 相互 转换
    group by 的用法
    谈 计算时间的天数差
    领域建模
    Java Classloader详解
    阿里巴巴Java招聘
    Maven Archetype
    负载均衡
    Maven
  • 原文地址:https://www.cnblogs.com/hotsnow/p/9477891.html
Copyright © 2011-2022 走看看