zoukankan      html  css  js  c++  java
  • 09机器学习实战之多元线性回归

    基本概念

    1. 与简单线性回归区别(simple linear regression)
              多个自变量(x)
     
    2. 多元回归模型
         y=β0+βx12x2+ ... +βpxp
        其中:β0,β1,β2... βp是参数
                     ε是误差值
     
    3. 多元回归方程
         E(y)=β0+βx12x2+ ... +βpxp
     
    4. 估计多元回归方程:
         y_hat=b0+bx1+b2x2+ ... +bpxp
     
        一个样本被用来计算β0,β1,β2... βp的点估计b0, b1, b2,..., bp
    5. 估计流程  (与简单线性回归类似)
    6. 估计方法
            使sum of squares最小    
    运算与简单线性回归类似,涉及到线性代数和矩阵代数的运算

    推导过程 

    第一项的 (X^T)X是一个对称阵,也可以看出第一项是一个标量,因为X是一个m*(1+n)的矩阵,θ是一个(1+n)*1的矩阵,

    根据矩阵相乘,可以得到第一项是一个标量

    此处用到了向量的求导,性质如下:

     第二项同理根据矩阵相乘,也是一个标量

     向量的求导,性质如下:

     第三项同理根据矩阵相乘,也是一个标量

      向量的求导,性质如下:

     

    代码实现

    In [1]:
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    In [12]:
    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    
    In [13]:
    X.shape
    
    Out[13]:
    (506, 13)
    In [14]:
    X = X[y < 50.0]
    y = y[y < 50.0]
    
    In [15]:
    X.shape
    
    Out[15]:
    (490, 13)
    In [56]:
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    
    In [57]:
    from ml09linearRegression2 import LinearRegression
    
    reg = LinearRegression()
    
    In [58]:
    reg.fit_normal(X_train, y_train)
    
    Out[58]:
    LinearRegression()
    In [59]:
    reg.coef_
    
    Out[59]:
    array([-1.02165165e-01,  2.90834759e-02, -3.20513846e-02,  3.87701319e-01,
           -1.22357592e+01,  3.55691305e+00, -2.81445439e-02, -1.10019435e+00,
            2.37232297e-01, -1.35143455e-02, -8.66922512e-01,  5.86471407e-03,
           -3.67607741e-01])
    In [60]:
    reg.intercept_
    
    Out[60]:
    34.867232717346994
    In [61]:
    reg.score(X_test, y_test)
    
    Out[61]:
    0.7790737176672187
     
    import numpy as np
    from ml09metrics import r2_score
    
    
    class LinearRegression:
    
        def __init__(self):
            """初始化Linear Regression模型"""
            self.coef_ = None
            self.intercept_ = None
            self._theta = None
    
        def fit_normal(self, X_train, y_train):
            """根据训练数据集X_train, y_train训练Linear Regression模型"""
            assert X_train.shape[0] == y_train.shape[0], 
                "the size of X_train must be equal to the size of y_train"
    
            X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
            self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
    
            self.intercept_ = self._theta[0]
            self.coef_ = self._theta[1:]
    
            return self
    
        def predict(self, X_predict):
            """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
            assert self.intercept_ is not None and self.coef_ is not None, 
                "must fit before predict!"
            assert X_predict.shape[1] == len(self.coef_), 
                "the feature number of X_predict must be equal to X_train"
    
            X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
            return X_b.dot(self._theta)
    
        def score(self, X_test, y_test):
            """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""
    
            y_predict = self.predict(X_test)
            return r2_score(y_test, y_predict)
    
        def __repr__(self):
            return "LinearRegression()"
    import numpy as np
    from math import sqrt
    
    
    def accuracy_score(y_true, y_predict):
        """计算y_true和y_predict之间的准确率"""
        assert len(y_true) == len(y_predict), 
            "the size of y_true must be equal to the size of y_predict"
    
        return np.sum(y_true == y_predict) / len(y_true)
    
    
    def mean_squared_error(y_true, y_predict):
        """计算y_true和y_predict之间的MSE"""
        assert len(y_true) == len(y_predict), 
            "the size of y_true must be equal to the size of y_predict"
    
        return np.sum((y_true - y_predict) ** 2) / len(y_true)
    
    
    def root_mean_squared_error(y_true, y_predict):
        """计算y_true和y_predict之间的RMSE"""
    
        return sqrt(mean_squared_error(y_true, y_predict))
    
    
    def mean_absolute_error(y_true, y_predict):
        """计算y_true和y_predict之间的RMSE"""
        assert len(y_true) == len(y_predict), 
            "the size of y_true must be equal to the size of y_predict"
    
        return np.sum(np.absolute(y_true - y_predict)) / len(y_true)
    
    
    def r2_score(y_true, y_predict):
        """计算y_true和y_predict之间的R Square"""
    
        return 1 - mean_squared_error(y_true, y_predict) / np.var(y_true)

    scikit-learn实现多元线性回归

    In [32]:
    from sklearn import datasets
    
    In [33]:
    boston = datasets.load_boston()
    X = boston.data
    y = boston.target
    
    In [34]:
    X = X[y < 50.0]
    y = y[y < 50.0]
    
    In [35]:
    from sklearn.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=666)
    
    In [36]:
    from sklearn.linear_model import LinearRegression
    
    In [37]:
    lin_reg = LinearRegression()
    
    In [38]:
    lin_reg.fit(X_train, y_train)
    
    Out[38]:
    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
             normalize=False)
    In [39]:
    lin_reg.coef_
    
    Out[39]:
    array([-1.15625837e-01,  3.13179564e-02, -4.35662825e-02, -9.73281610e-02,
           -1.09500653e+01,  3.49898935e+00, -1.41780625e-02, -1.06249020e+00,
            2.46031503e-01, -1.23291876e-02, -8.79440522e-01,  8.31653623e-03,
           -3.98593455e-01])
    In [40]:
    lin_reg.intercept_
    
    Out[40]:
    32.59756158869987
    In [41]:
    lin_reg.score(X_test, y_test)
    
    Out[41]:
    0.8009390227581038
     

    KNN Regressor

    In [42]:
    from sklearn.neighbors import KNeighborsRegressor
    
    knn_reg = KNeighborsRegressor()
    
    In [44]:
    knn_reg.fit(X_train, y_train)
    
    Out[44]:
    KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
              metric_params=None, n_jobs=None, n_neighbors=5, p=2,
              weights='uniform')
    In [45]:
    knn_reg.score(X_test, y_test)
    
    Out[45]:
    0.602674505080953
     

    sklearn的网格搜索

    In [46]:
    from sklearn.model_selection import GridSearchCV
    
    In [50]:
    para_grid = [
        {"weights": ["uniform"],
         "n_neighbors": [i for i in range(1, 11)]
         },
        {"weights": ["distance"],
         "n_neighbors": [i for i in range(1, 11)],
         "p": [i for i in range(1, 6)]
         }
    ]
    
    In [51]:
    knn_reg = KNeighborsRegressor()
    # n_jobs是使用多少个cpu,-1表示所有
    grid_search = GridSearchCV(knn_reg, para_grid, n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)
    
     
    Fitting 3 folds for each of 60 candidates, totalling 180 fits
    
     
    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
    [Parallel(n_jobs=-1)]: Done  70 tasks      | elapsed:    3.1s
    [Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    3.7s finished
    C:UsersAdministratorEnvsMachineLearninglibsite-packagessklearnmodel_selection\_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
      DeprecationWarning)
    
    Out[51]:
    GridSearchCV(cv='warn', error_score='raise-deprecating',
           estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
              metric_params=None, n_jobs=None, n_neighbors=5, p=2,
              weights='uniform'),
           fit_params=None, iid='warn', n_jobs=-1,
           param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
           pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
           scoring=None, verbose=1)
    In [52]:
    grid_search.best_params_
    
    Out[52]:
    {'n_neighbors': 6, 'p': 1, 'weights': 'distance'}
    In [53]:
    grid_search.best_score_  # 此处的scope使用了交叉验证
    
    Out[53]:
    0.6060528490355778
    In [54]:
    grid_search.best_estimator_.score(X_test, y_test)
    # 这个与线性回归的得分标准一样
    
    Out[54]:
    0.7353138117643773
     

    多元线性回归的更多思考

    In [55]:
    import numpy as np
    
    np.argsort(lin_reg.coef_)
    
    Out[55]:
    array([ 4,  7, 10, 12,  0,  3,  2,  6,  9, 11,  1,  8,  5], dtype=int64)
    In [58]:
    boston.feature_names
    
    Out[58]:
    array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
           'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
    In [59]:
    boston.feature_names[np.argsort(lin_reg.coef_)]
    
    Out[59]:
    array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'CHAS', 'INDUS', 'AGE',
           'TAX', 'B', 'ZN', 'RAD', 'RM'], dtype='<U7')
    In [60]:
    print(boston.DESCR)
    
     
    .. _boston_dataset:
    
    Boston house prices dataset
    ---------------------------
    
    **Data Set Characteristics:**  
    
        :Number of Instances: 506 
    
        :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
    
        :Attribute Information (in order):
            - CRIM     per capita crime rate by town
            - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
            - INDUS    proportion of non-retail business acres per town
            - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
            - NOX      nitric oxides concentration (parts per 10 million)
            - RM       average number of rooms per dwelling
            - AGE      proportion of owner-occupied units built prior to 1940
            - DIS      weighted distances to five Boston employment centres
            - RAD      index of accessibility to radial highways
            - TAX      full-value property-tax rate per $10,000
            - PTRATIO  pupil-teacher ratio by town
            - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
            - LSTAT    % lower status of the population
            - MEDV     Median value of owner-occupied homes in $1000's
    
        :Missing Attribute Values: None
    
        :Creator: Harrison, D. and Rubinfeld, D.L.
    
    This is a copy of UCI ML housing dataset.
    https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
    
    
    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
    
    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
    prices and the demand for clean air', J. Environ. Economics & Management,
    vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
    ...', Wiley, 1980.   N.B. Various transformations are used in the table on
    pages 244-261 of the latter.
    
    The Boston house-price data has been used in many machine learning papers that address regression
    problems.   
         
    .. topic:: References
    
       - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
       - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann. 
  • 相关阅读:
    Python shutil模块
    Flask 上传文件
    Flask DBUtils
    flash-session
    Flash 上下文管理
    python 栈
    python 偏函数
    threding.local
    next() 与 nextLine() 区别
    Thread.sleep(1000*3); // 休眠3秒
  • 原文地址:https://www.cnblogs.com/xinmomoyan/p/10788455.html
Copyright © 2011-2022 走看看