zoukankan      html  css  js  c++  java
  • Ridge Regression and Ridge Regression Kernel


    Ridge Regression and Ridge Regression Kernel

    Reference:
    1. scikit-learn linear_model ridge regression
    2. Machine learning for quantum mechanics in a nutshell Authors
    3. sample plot ridge path code from #Fabian Pedregosa -- fabian.pedregosa@inria.fr

    Ridge regression###

    Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares:
    $$underset{w}{min} {left| Xw-y ight|_2^{2}} + lambdaleft|{w} ight|_2^2$$
    Here, (lambda ge 0) is a complexity parameter that controls the amount of shrinkage: the larger the value of (lambda), the greater the amount of shrinkage and thus the coefficients become robust to collinearity. The figure below show the relationship between the (lambda) and oscillations of the weights.

    Ridge Regression Theory###

    the ( ilde{x}) meaning the test dataset, and (x_i) means the training data:
    $$f( ilde{x}) = sum_{i=0}^n alpha_i k( ilde{x}, x)$$
    Although the dimensionalty of (mathbf{Hilbert}) space can be high, the solution lives in the finite span of the projected training data, enabling a finite representation. The corresponding convex optimization problems is:
    $$underset{alpha varepsilon R^{n}} {mathrm{argmin}} sum_{i=1}^n(f(x_i - y_i)^2 + lambdaleft | f ight |_{2}^{H} $$
    $$Leftrightarrow underset{alpha varepsilon R^{n}} {mathrm{argmin}} left langle Kalpha-y, Kalpha-y ight angle + lambda alpha^{T}Kalpha$$
    Where (left | f ight |_{2}^{H}) is the norm of (f) in (mathbf{Hilbert}) space, the complexity of the linear ridge regression model in feature space, and (K varepsilon R^{n imes n}, K_{i, j}=k(x_i, x_j)) is the kernel matrix between training sampels. As before, setting the gradient to 0 yeilds an analytic solution for the regression coefficients:
    $$alpha{T}K{2}alpha - 2alpha^{T}Ky + y^{T}y + lambdaalpha^{T}Kalpha = 0 Leftrightarrow K^{2}alpha + lambda Kalpha$$
    $$Leftrightarrow alpha=(K+lambda I)^{-1}y $$
    where (lambda) is a hyperparameter determining strength of regularization. The norm of the coefficient vector (mathbfsigma) is related to the smooothness and simpler models.

    Figure below give a example of a KRR model with Gaussian kernel that demostates the role length-scale hyperparameter (sigma). Although (sigma) is directly related to the regularization term. But it's does control smoothness of the predictor, and effectively regularizes.

    The figure above show that kernel ridge regression with Gaussian kernel and different length scales. We learn (cos(x)), the KRR models(dasded lines).The regularization constant (lambda) set to be (10^{-14}), a very small (mathbf{sigma}) fit the train set well but in-between error are very bigger, while a too large (sigma) results in too close to linear model, with both high train error and prediction error.

    From above description we can see all information about KRR model is contained in the matrix (mathbf K) of kernel evaluations between training data. Similarly, all info required to predict new inputs ( ilde{x}) is contained in the kernel matrix of training set versus prediction data.

    Kernel Ridge Regression###

    The regression coefficients (lambda) are obtained by solving the linear system of equations ((mathbf K + lambda mathbf I)lambda=mathbf y), where ((mathbf K + lambda mathbf I)) is symmetric and strictly positive definite. To solve this equation we can use Cholesky decomposition (mathbf K + lambda mathbf I), where (mathbf U) is upper triangular. One then we break up the (mathbf{U^{T}Ulambda=y}) equantion into 2 equations, the first is (mathbf{U^{T}eta=y}), and the other is (mathbf{Ulambda=eta}). Since (U^{T}) is lower triangular and (mathbf{U}) is upper triangular, this requires only 2 striaghtforward passes over the data called forward and backward substitution, respectively. For (mathbf{U^{T}eta=y}), just like below:
    $$mathbf{U_{1, 1}^T eta_1=y_1} Leftrightarrow eta_1=y1/u_{1,1}$$
    $$mathbf{U_{2, 1}^T eta_1 + mathbf{U_{2, 2}^T} eta_1=y_1} Leftrightarrow eta_2=(y2 - u_{1, 2}eta_1)/u_{2,2}$$
    $$ ...$$
    $$sum_{j}^i mathbf{U_{i, j}^T}eta_j=y_i Leftrightarrow eta_i=(y_i-sum_{j=1}^{i-1}u_{j,i}eta_j)/u_{i,j}$$
    Once the model is trained, then predictions can be made, and the prediction for a new input (mathbf{ ilde{x}}) is the inner product between the vector of coefficients and the vector of corresponding kernel evaluations.

    For a test datasets (mathbf{ ilde{X}} varepsilon Bbb{R^{n imes d}}), the rows (mathbf{ ilde{x},..., ilde{x_n}}), and (mathbf{L} varepsilon Bbb{R^{n imes n}}) is the kernel matrix of training versus prediction inputs. (mathbf{L_{i,j}=k(x_i, ilde{x_j})})
    This method has the same order of complexity than an Ordinary Least Squares. In the next notes I will introduce the detail of implementation about KRR.

  • 相关阅读:
    发行版Linux和麒麟操作系统下netperf 网络性能测试
    Linux下RPM包的安装
    CentOS7下安装NVIDIA独立显卡驱动出现X service error问题解决方法
    真实的物理机安装Centos7系统后网卡只有lo没有eno1的解决办法:实际上是物理机未安装网驱动卡
    Windows7系统下OpenCV2.4.4+PCL1.6.0+SSBA3.0+VS2010 IDE32环境下编译和安装以实现Sfm和PCL点云数据可视化
    VC++编译错误error C2065: “HANDLE”: 未声明的标识符及添加winbase.h后提示winbase.h(243): error C2146: 语法错误: 缺少“;”(在标识符“Internal”的前面)的解决办法
    [转]VS2013+简单稀疏光束调整库SSBA配置(64位编译)
    std::max、std::min error C2589: “(”:“::”右边的非法标记,error C2059: 语法错误:“::”
    理解图像卷积和滤波的典型博文
    [转]OPENCV3.3+CUDA9.0 环境搭建若干错误总结
  • 原文地址:https://www.cnblogs.com/wenwangt/p/5337596.html
Copyright © 2011-2022 走看看