zoukankan      html  css  js  c++  java
  • [sklearn] 官方例程-Imputing missing values before building an estimator 随机填充缺失值

    官方链接:http://scikit-learn.org/dev/auto_examples/plot_missing_values.html#sphx-glr-auto-examples-plot-missing-values-py

    该例程是为了说明对缺失值的随即填充训练出的estimator表现优于直接删掉有缺失字段值的estimator

    例程代码及附加注释如下:

    ---------------------------------------------

    import numpy as np
    
    from sklearn.datasets import load_boston
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import Imputer
    from sklearn.model_selection import cross_val_score
    
    # 设定随机数种子 rng = np.random.RandomState(0) # 载入数据 波士顿房价 dataset = load_boston() X_full, y_full = dataset.data, dataset.target n_samples = X_full.shape[0] n_features = X_full.shape[1] # Estimate the score on the entire dataset, with no missing values
    # 随机森林--回归 random_state-随机种子 n_estimator 森林里树的数目 estimator = RandomForestRegressor(random_state=0, n_estimators=100)
    # 交叉验证分类器的准确率 score = cross_val_score(estimator, X_full, y_full).mean() print("Score with the entire dataset = %.2f" % score) # Add missing values in 75% of the lines missing_rate = 0.75 n_missing_samples = int(np.floor(n_samples * missing_rate))
    # hstack 把两个数组拼接起来-行数需要一致 missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples, dtype=np.bool), np.ones(n_missing_samples, dtype=np.bool)))
    # 打乱随机数组顺序
    rng.shuffle(missing_samples) missing_features = rng.randint(0, n_features, n_missing_samples) # Estimate the score without the lines containing missing values X_filtered = X_full[~missing_samples, :] y_filtered = y_full[~missing_samples] estimator = RandomForestRegressor(random_state=0, n_estimators=100) score = cross_val_score(estimator, X_filtered, y_filtered).mean() print("Score without the samples containing missing values = %.2f" % score) # Estimate the score after imputation of the missing values X_missing = X_full.copy() X_missing[np.where(missing_samples)[0], missing_features] = 0 y_missing = y_full.copy() estimator = Pipeline([("imputer", Imputer(missing_values=0, strategy="mean", axis=0)), ("forest", RandomForestRegressor(random_state=0, n_estimators=100))]) score = cross_val_score(estimator, X_missing, y_missing).mean() print("Score after imputation of the missing values = %.2f" % score)

    ---------------------------------------------------
    补充:
    A. numpy.where()用法:

  • 相关阅读:
    AGC 044 A
    example
    python3遇到的问题
    构建开发环境
    pandas处理数据
    pandas.DataFrame对象解析
    pandas再次学习
    监督式学习
    机器学习的基础概念
    赖世雄老师的音标课,旋元佑老师的语法书
  • 原文地址:https://www.cnblogs.com/Arborday/p/8327426.html
Copyright © 2011-2022 走看看