zoukankan      html  css  js  c++  java
  • 简单线性回归

    简单线性回归


    思想简单,是许多强大的非线性模型的基础,具有很好的可解释性。背后有强大的数学理论支持。

    其本质:寻找一条直线,最大程度地去拟合样本特征和样本输出标记之间的关系

    mark

    ​ 之所以称为简单线性回归是因为样本特征只有一个。假设我们已经找到最佳拟合直线方程y=ax+b,对于每个样本点x(i)>根据直线方程我们就能预测出y(i),其真值为y,我们希望预测值与真值尽可能的小,y(i)>-y尽可能的小,但是其有可能有有负,因此我们将其进行平方。并且这个函数处处可导。

    ​ 目标:使$sum_{i=1}m$(y-y<sub>(i)</sub>)[2]$尽可能的小,

    ​ 目标:找到a和b使得$sum_{i=1}m$(y-ax+b)[2]尽可能小

    ​ 在机器学习中我们把上述的目标函数称为损失函数(loss function)或者称为效用函数(utility function),通过分析问题,确定损失函数或者效用函数,通过优化这些目标函数,最终获得机器学习模型。

    ​ 其实上述的目标函数就是一个最典型的最小二乘法问题,最下化误差的平方。

    mark

    一、最小二乘法

    目标:找到a和b使得$sum_{i=1}m$(y-ax+b)[2]尽可能小

    mark

    分别对a和b求偏导,并令其等于0。首先对b进行求偏导,b的计算较为简单。

    mark

    接下来对a求偏导。

    mark

    mark

    ​ 其实到这里,已经推到完了,a和b都有了自己的公式,就可以编程实现了。但是,往往在学习过程中,我们还会在把它化简一步。

    mark

    ​ 这样的到a其实是为了更加方便使用向量化运算能够缩短程序的运行时间,效率高20倍以上。

    二、实现简单的线性回归算法

    ​ 准备一些简单的数据。

    import numpy as np
    import matplotlib.pyplot as plt
    
    x = np.array([1., 2., 3., 4., 5.])
    y = np.array([1., 3., 2., 3., 5.])
    
    plt.scatter(x, y)
    plt.axis([0, 6, 0, 6])
    plt.show()
    
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    up = 0
    down = 0
    for x_i, y_i in zip(x, y):
        up += (x_i - x_mean) * (y_i - y_mean)
        down += (x_i - x_mean) ** 2    
    a = up/down
    b = y_mean - a * x_mean
    

    ​ 这样我们就求出了a和b,也就求出了直线方程y=a+b

    y_hat = a * x + b
    plt.scatter(x, y)
    plt.plot(x, y_hat, 'r')
    plt.axis([0, 6, 0, 6])
    plt.show()
    # 接下来测试一个新的数据
    x_new = 6
    y_predict = a * x_new + b
    

    输出结果:5.2

    三、使用面向对象编程实现

    import numpy as np
    
    class SimpleLinearRegreesion1(object):
    
        def __init__(self):
            self.a_ = None
            self.b_ = None
    
        def fit(self, x_train, y_train):
            assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
    
            x_mean = np.mean(x_train)
            y_mean = np.mean(y_train)
    
            up = 0
            down = 0
            for x_i, y_i in zip(x_train, y_train):
                up += (x_i - x_mean) * (y_i - y_mean)
                down += (x_i - x_mean) ** 2
    
            self.a_ = up / down
            self.b_ = y_mean - self.a_ * x_mean
            return self
    
        def predict(self, x_predict):
            assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
            "这个函数用来处理一次传入多个x_predict"
            return [self._predict(x) for x in x_predict]
    
        def _predict(self, x_single):
            "预测单个数据"
            return self.a_ * x_single + self.b_
    
        def __repr__(self):
            return "SimpleLinearRegression"
    
    
    if __name__ == '__main__':
        import numpy as np
    
        x = np.array([1., 2., 3., 4., 5.])
        y = np.array([1., 3., 2., 3., 5.])
        x_new = 6
    
        reg1 = SimpleLinearRegreesion1()
        reg1.fit(x, y)
        y_predict = reg1.predict(np.array([x_new]))
        print(y_predict)
        print(reg1.a_)
        print(reg1.b_)
    

    四、向量化运算

    import numpy as np
    
    class SimpleLinearRegreesion2(object):
    
        def __init__(self):
            self.a_ = None
            self.b_ = None
    
        def fit(self, x_train, y_train):
            assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
    
            x_mean = np.mean(x_train)
            y_mean = np.mean(y_train)
    
            up = (x_train - x_mean).dot(y_train - y_mean)
            down = (x_train - x_mean).dot((x_train - x_mean))
    
            self.a_ = up / down
            self.b_ = y_mean - self.a_ * x_mean
            return self
    
        def predict(self, x_predict):
            assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
            "这个函数用来处理一次传入多个x_predict"
            return [self._predict(x) for x in x_predict]
    
        def _predict(self, x_single):
            "预测单个数据"
            return self.a_ * x_single + self.b_
    
        def __repr__(self):
            return "SimpleLinearRegression"
    

    ​ 接下来进行性能测试:

    if __name__ == '__main__':
        import numpy as np
        import time
    
        m = 100000
        x = np.random.random(size=m)
        y = x * 2.0 + 3.0 + np.random.normal(0, 1, size=m)
    
        start1 = time.time()
        reg1 = SimpleLinearRegreesion1()
        reg1.fit(x, y)
        stop1 = time.time()
        print("运行时间:%s" % (stop1 - start1))
    
        start2 = time.time()
        reg2 = SimpleLinearRegreesion2()
        reg2.fit(x, y)
        stop2 = time.time()
        print("运行时间:%s" % (stop2 - start2))
    

    输出结果:

    运行时间:0.07222509384155273
    运行时间:0.008754968643188477

    五、线性回归算法的评价

    1、均方误差(MSE)

    ​ 均方误差(Mean Squared Error)

    mark

    2、均方根误差(RMSE)

    ​ 均方根误差(Root Mean Squared Error)

    mark

    ​ 这样,MSE就和y有着相同的量纲。

    3、平均绝对误差(MAE)

    ​ 平均绝对误差(Mean Absolute Error)

    mark

    我们在最初的时候选择目标函数不是它,因为绝对值不是一个处处可导的函数,因此不能作为损失函数,但是作为评价函数还是相当不错的。

    ​ 接下里使用波士顿房价做简单线性回归,并使用这面三中评价函数对线性回归模型进行评价。

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    boston = datasets.load_boston()
    print(boston.DESCR)
    boston.feature_names
    x = boston.data[:, 5]   # 只是用房间数量这个特征
    y = boston.target		# 波士顿房价
    plt.scatter(x, y)
    plt.show()
    

    mark

    ​ 通过可视化发现最上方有一排异常点,所以在数据处理的过程中我们需要考虑,可能是因为在统计的过程中设计了上限,就是大于50万的都按50万计算,因此,数据的预处理就显得尤为关键。

    x = x[y < np.max(y)]
    y = y[y < np.max(y)]
    plt.scatter(x, y)
    plt.show()
    
    from sklearn.model_selection import train_test_split
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=666)
    x_train.shape
    x_test.shape
    %run D:/goodgoodstudy/linear_regression/simple_linear_regression.py
    reg = SimpleLinearRegreesion2()
    reg.fit(x_train, y_train)
    reg.a_
    reg.b_
    plt.scatter(x_train, y_train)
    plt.plot(x_train, reg.predict(x_train), 'r')
    plt.show()
    
    

    ​ 接下来测试三种评价函数:

    y_predict = reg.predict(x_test)
    mes_test = np.sum((y_predict - y_test)**2) / len(y_test)
    mes_test
    
    

    ​ 输出结果:28.215949368640796

    from math import sqrt
    rmese_test = sqrt(mes_test)
    rmese_test
    
    

    ​ 输出结果:5.311868726600912

    mae_test = np.sum(np.absolute(y_predict - y_test)) / len(y_test)
    mae_test
    
    

    ​ 输出结果:3.9489046062737843

    使用函数把这三个评价函数封装一下:

    import numpy as np
    from math import sqrt
    
    def mean_square_error(y_true, y_predict):
        assert len(y_true) == len(y_predict), "the size of y_true must be equal to the size of y_predict"
    
        return np.sum((y_predict - y_true)**2) / len(y_true)
    
    def root_mean_square_error(y_true, y_predict):
        return sqrt(mean_square_error(y_true, y_predict))
    
    def mean_absolute_error(y_true, y_predict):
        assert len(y_true) == len(y_predict), "the size of y_true must be equal to the size of y_predict"
    
        return np.sum(np.absolute(y_predict - y_true)) / len(y_true)
    
    

    ​ 其实,在sklearn中又已经封装好了,那威慑呢么还要写一遍呢?因为不能仅仅只会调库,底层代码还是要知道是怎么实现的。那么我们看看sklearn中封装的函数和自己写的输出是否一致。

    from sklearn.metrics import mean_absolute_error
    from sklearn.metrics import mean_squared_error
    
    mean_absolute_error(y_test, y_predict)
    mean_squared_error(y_test, y_predict)
    
    

    六、最好的衡量线性回归算法的指标

    R squared

    ​ 前面已经讲了三个评价线性回归模型的函数,为什么还会有最好的衡量线性回归算法的指标呢?我们想一下分类问题。1是完全正确,0是错误,这样我们就能大概知道分类的情况,但是如果使用上面三个评价函数,我们并不能知道差了多少。只知道有误差,所以我们就像办法把误差范围也归一到0-1之间。

    mark

    mark

    ​ 如果我们的模型预测产生的错误为0或者很小也就是后面一项基本为0,就是R[2]约等于1,模型很好。如果模型预测产生的错误很大,基本跟用均值预测产生的误差接近,后面一项基本为1,就是R[2]越等与0,模型很差。如果我们的模型预测产生的错误比用均值产生的错误还大,那就没什么意义了,还不如直接用均值预测,那就没必要了,**如果R[2]小于0,很有可能就是因为我们的数据根本不存在线性关系**。因此R[2]一定不会小于0,因此,我们成用均值预测的模型称为Baseline Model。这就是底线,基本不用学习就可以通过统计得到的模型。

    mark

    ​ 通过上面这个式子,可以发现R[^2]背后还存在着重大的数学统计意义。接下来就用编程去实现它。

    def r2_score(y_true, y_predict):
    
        return 1 - mean_square_error(y_true, y_predict) / np.var(y_true)
    
    

    sklearn中:

    from sklearn.metrics import r2_score
    
    r2_score(y_test, y_predict)
    
    

    最后,我们再把这个加到我们封装好的简单线性回归的类中。

    import numpy as np
    from .metrics import r2_score
    
    class SimpleLinearRegreesion1(object):
    
        def __init__(self):
            self.a_ = None
            self.b_ = None
    
        def fit(self, x_train, y_train):
            assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
    
            x_mean = np.mean(x_train)
            y_mean = np.mean(y_train)
    
            up = 0
            down = 0
            for x_i, y_i in zip(x_train, y_train):
                up += (x_i - x_mean) * (y_i - y_mean)
                down += (x_i - x_mean) ** 2
    
            self.a_ = up / down
            self.b_ = y_mean - self.a_ * x_mean
            return self
    
        def predict(self, x_predict):
            assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
            "这个函数用来处理一次传入多个x_predict"
            return [self._predict(x) for x in x_predict]
    
        def _predict(self, x_single):
            "预测单个数据"
            return self.a_ * x_single + self.b_
    
        def __repr__(self):
            return "SimpleLinearRegression"
    
    
    class SimpleLinearRegreesion2(object):
    
        def __init__(self):
            self.a_ = None
            self.b_ = None
    
        def fit(self, x_train, y_train):
            assert x_train.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert len(x_train) == len(y_train), "the size of x_train must be equal to the size of y_train"
    
            x_mean = np.mean(x_train)
            y_mean = np.mean(y_train)
    
            up = (x_train - x_mean).dot(y_train - y_mean)
            down = (x_train - x_mean).dot((x_train - x_mean))
    
            self.a_ = up / down
            self.b_ = y_mean - self.a_ * x_mean
            return self
    
        def predict(self, x_predict):
            assert x_predict.ndim == 1, "Simple linear regression can only solve single feature training data"
            assert self.a_ is not None and self.b_ is not None, "must fit before predict!"
            "这个函数用来处理一次传入多个x_predict"
            return [self._predict(x) for x in x_predict]
    
        def _predict(self, x_single):
            "预测单个数据"
            return self.a_ * x_single + self.b_
    
        def score(self, x_test, y_test):
    
            y_predict = self.predict(x_test)
            return r2_score(y_test, y_predict)
    
        def __repr__(self):
            return "SimpleLinearRegression"
    
    

    ​ 最后,简单线性回归就到这里了!想法思想很简单,但是其中数学统计知识还是值得我们学习和思考的。

    我是尾巴

    mark

    本次推荐:集万千工具为一体的网站。

    http://tool.uixsj.cn/

    mark

    继续加油!!!今天又是元气满满的一天!

  • 相关阅读:
    使用静态工厂方法的好处和坏处
    xUtils3源码分析(一):view的绑定
    在laravel之外使用eloquent
    ruby里面的毒瘤
    ruby的代码风格
    ruby里面的属性访问器
    ruby里面module和class的区别
    unity里面查找所有物体
    android studio安装须知
    intellij系列ide配置
  • 原文地址:https://www.cnblogs.com/zhangkanghui/p/11297546.html
Copyright © 2011-2022 走看看