zoukankan      html  css  js  c++  java
  • 数据归一化

    数据归一化


    将所有的数据映射到同一尺度。

    ​ 首先,为什么需要数据归一化?举个简答的例子。样本间的距离时间所主导,这样在样本1以[1, 200]输入到模型中去的时候,由于200可能会直接忽略到1的存在,因此我们需要将数据进行归一化。比如将天数转换为占比1年的比例,200/365=0.5479, 100/365=0.2740。

    mark

    一、最值归一化

    最值归一化(Normalization):把所有数据映射到0-1之间。适用于分布有明显边界的情况,受 outliner影响较大。

    xscale=(x-xmin)/(xmax-xmin)

    import numpy as np
    import matplotlib.pyplot as plt
    
    x = np.random.randint(0, 100, size=100)
    x
    

    输出结果:

    array([84, 18, 75, 75, 78, 30, 39, 33, 29, 30, 48, 77, 54, 30,  1, 32, 91,
           60, 73, 78, 89, 16, 71, 47, 87, 43, 24, 67, 70, 50, 58, 56, 69, 11,
           19, 97, 64, 53, 37, 18, 84, 77,  6,  3, 91, 48, 14,  6, 70, 36, 93,
           43, 78, 78, 73, 18, 96, 58, 77, 78, 29, 96, 75, 59, 58, 19, 65, 90,
           67, 73, 72,  1, 89, 70, 59, 96, 42, 73, 58,  8, 61, 65, 78, 86, 98,
           94, 52,  1, 59, 86, 44, 28, 87,  2, 91, 75, 19, 91, 46, 92])
    
    (x-np.min(x)) / (np.max(x) - np.min(x))
    

    输出结果:

    array([0.8556701 , 0.17525773, 0.7628866 , 0.7628866 , 0.79381443,
           0.29896907, 0.39175258, 0.32989691, 0.28865979, 0.29896907,
           0.48453608, 0.78350515, 0.54639175, 0.29896907, 0.        ,
           0.31958763, 0.92783505, 0.60824742, 0.74226804, 0.79381443,
           0.90721649, 0.15463918, 0.72164948, 0.4742268 , 0.88659794,
           0.43298969, 0.2371134 , 0.68041237, 0.71134021, 0.50515464,
           0.58762887, 0.56701031, 0.70103093, 0.10309278, 0.18556701,
           0.98969072, 0.64948454, 0.53608247, 0.37113402, 0.17525773,
           0.8556701 , 0.78350515, 0.05154639, 0.02061856, 0.92783505,
           0.48453608, 0.13402062, 0.05154639, 0.71134021, 0.36082474,
           0.94845361, 0.43298969, 0.79381443, 0.79381443, 0.74226804,
           0.17525773, 0.97938144, 0.58762887, 0.78350515, 0.79381443,
           0.28865979, 0.97938144, 0.7628866 , 0.59793814, 0.58762887,
           0.18556701, 0.65979381, 0.91752577, 0.68041237, 0.74226804,
           0.73195876, 0.        , 0.90721649, 0.71134021, 0.59793814,
           0.97938144, 0.42268041, 0.74226804, 0.58762887, 0.07216495,
           0.6185567 , 0.65979381, 0.79381443, 0.87628866, 1.        ,
           0.95876289, 0.5257732 , 0.        , 0.59793814, 0.87628866,
           0.44329897, 0.27835052, 0.88659794, 0.01030928, 0.92783505,
           0.7628866 , 0.18556701, 0.92783505, 0.46391753, 0.93814433])
    
    X = np.random.randint(0, 100, (50, 2))
    X[:10, :]
    X = np.array(X, dtype=float)
    X[:, 0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))
    X[:, 0]
    X[:, 1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))
    X[:, 1]
    X[:10, :]
    plt.scatter(X[:,0], X[:,1])
    plt.show()
    np.mean(X[:,0])
    np.std(X[:, 0])
    np.mean(X[:,1])
    np.std(X[:, 1])
    

    二、均值方差归一化

    均值方差归一化(standardization):把所有数据归一化到均值为0方差为1的分布中。适用于数据分 布没有明显的边界,有可能存在极端的数据值。

    xscale=(x-xmean)/s

    x2 = np.random.randint(0, 100, (50, 2))
    x2 = np.array(x2, dtype=float)
    x2[:, 0] = (x2[:,0] - np.mean(x2[:,0])) / np.std(x2[:,0])
    x2[:, 1] = (x2[:,1] - np.mean(x2[:,1])) / np.std(x2[:,1])
    plt.scatter(x2[:,0], x2[:,1])
    plt.show() 
    np.mean(x2[:,0])
    np.std(x2[:,0])
    np.mean(x2[:,1])
    np.std(x2[:,1])
    

    三、对训练集和测试集都进行归一化?

    ​ 我们得到数据集训练模型之前,首先会把数据集进行切分,分成训练集和测试集,如果需要对数据进行归一化,我们可以很容易地通过训练集得到其均值和方差,最大值最小值。但是测试集呢?如何对测试集进行数据归一化呢?

    ​ 正常情况下,测试数据集是模拟真实环境的,既然是真实环境,我们就很可能无法得到所有的测试集。因此当有一个新的数据需要进行预测时,我们需要使用训练集的均值方差,最大值最小值对测试集数据进行归一化。在scikit-learn中封装了Scaler保存训练数据集中的均值和方差等关键信息。

    import numpy as np
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler	
    
    iris = datasets.load_iris()
    
    x = iris.data
    y = iris.target
    x[:10, :]
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=666)
    standarscaler = StandardScaler()
    standarscaler.fit(x_train)
    standarscaler.mean_
    standarscaler.scale_
    standarscaler.transform(x_train)
    x_train = standarscaler.transform(x_train)
    x_train
    x_test_standard = standarscaler.transform(x_test)
    x_test_standard
    

    ​ 接下来测试一下数据归一化之后KNN的性能:

    from sklearn.neighbors import KNeighborsClassifier
    knn_clf = KNeighborsClassifier()
    knn_clf.fit(x_train, y_train)
    knn_clf.fit(x_test_standard, y_test)
    knn_clf.score(x_test_standard, y_test)
    

    输出结果:1.0

    ​ 如果训练集进行了归一化,测试集不做归一化试试?

    knn_clf.score(x_test, y_test)
    

    输出结果:0.3333333333333333

    四、使用面向对象自己编写均值方差归一化

    from sklearn.preprocessing import StandardScaler # 在sklearn中
    
    import numpy as np
    
    
    class StandardScale(object):
    
        def __init__(self):
            self.mean_ = None
            self.scale_ = None
    
        def fit(self, x):
            "根据训练集x获得数据的均值和方差"
            assert x.ndim == 2, "the dimension of x must be 2"
    
            self.mean_ = np.array([np.mean(x[:, i]) for i in range(x.shape[1])])
            self.scale_ = np.array([np.std(x[:, i]) for i in range(x.shape[1])])
    
            return self
    
        def transform(self, x):
            "将x进行均值方差归一化"
            assert x.ndim == 2, "the dimension of x must be 2"
            assert self.mean_ is not None and self.scale_ is not None, 
            "must fit before transform"
            assert x.shape[1] == len(self.mean_), 
            "the feature number of x must be equal to mean_ and scale_"
    
            res_x = np.empty(shape=x.shape, dtype=float)
            for col in range(x.shape[1]):
                res_x[:, col] = (x[:, col] - self.mean_[col]) / self.scale_[col]
    
            return res_x
    

    五、使用面向对象自己编写最值归一化

    from sklearn.preprocessing import MinMaxScaler # 在sklearn中
    
    import numpy as np
    
    class MinMaxScale(object):
    
        def __init__(self):
            self.mean_ = None
            self.scale_ = None
    
        def fit(self, x):
            "根据训练集x获得数据的均值和方差"
            assert x.ndim == 2, "the dimension of x must be 2"
    
            self.mean_ = np.array([np.mean(x[:, i]) for i in range(x.shape[1])])
            self.scale_ = np.array([np.std(x[:, i]) for i in range(x.shape[1])])
            self.min_ = np.array([np.min(x[:, i]) for i in range(x.shape[1])])
            self.max_ = np.array([np.max(x[:, i]) for i in range(x.shape[1])])
    
            return self
    
        def transform(self, x):
            "将x进行均值方差归一化"
            assert x.ndim == 2, "the dimension of x must be 2"
            assert self.mean_ is not None and self.scale_ is not None, 
            "must fit before transform"
            assert x.shape[1] == len(self.mean_), 
            "the feature number of x must be equal to mean_ and scale_"
    
            res_x = np.empty(shape=x.shape, dtype=float)
            for col in range(x.shape[1]):
                res_x[:, col] = (x[:, col] - self.min_[col]) / (self.max_[col] - self.min_[col])
    
            return res_x
    

    其实,还有更多的数据归一化的方式,后续再进行完善!

  • 相关阅读:
    【秒懂音视频开发】10_PCM转WAV
    【秒懂音视频开发】09_播放PCM
    【秒懂音视频开发】08_音频录制02_编程
    【秒懂音视频开发】07_音频录制01_命令行
    【秒懂音视频开发】06_Qt开发基础
    高考数学考点关联表[Ⅳ]
    高考数学考点关联表[Ⅲ]
    高考数学考点关联表[Ⅱ]
    高考数学考点关联表[Ⅰ]
    反比例函数
  • 原文地址:https://www.cnblogs.com/zhangkanghui/p/11301274.html
Copyright © 2011-2022 走看看