zoukankan      html  css  js  c++  java
  • PRML 2: Regression Models

    1. Linear Regression Model

      The Least Square Method is derived from maximum likelihood under the assumption of a Gaussian noise distribution, and hence could give rise to the problem of over-fitting, which is a general property of MLE.

      The Regularized Least Squares, which is the generalized form of LSM, stems from a naive Bayesian approach called maximum posterior estimate. Given a certain MAP model, we can deduce the following closed-form solution to the linear regression:

        $vec{w}_{ML}=(lambda I+X^T X)^{-1}X^Tcdotvec{t}$,  where $Xinmathbb{R}^{N imes m}$, $vec{t}inmathbb{R}^{N imes 1}$, and hence $vec{w}in mathbb{R}^{m imes 1}$.

     1 function w = linRegress(X,t,lamb)
     2     % Closed-Form Solution of MAP Linear Regression
     3     % Precondtion: X is a set of data columns,
     4     %       row vector t is the labels of X
     5     % Postcondition: w is the linear model parameter
     6     %       such that y = w'* x
     7     if (nargin<3)
     8         % MLE, no regularizer (penalty term)
     9         lamb = 0;
    10     end
    11     m = size(X,1); % m-1 features, one constant term
    12     w = (X*X'+lamb*eye(m))X*t';
    13 end

      Since the problem is convex, a local solution must be the global solution, and thus we can find the global solution via some sequential methods such as batch gradient descent:

     1 function w = linRegress(X,t,err)
     2     % Batch Gradient Descent for Linear Regression
     3     %       by using the Newton-Raphson Method
     4     % Precondtion: X is a set of data columns,
     5     %       row vector t is the labels of X
     6     % Postcondition: w is the linear model parameter
     7     %       such that y = w'* x
     8     if (nargin<3)
     9         err = 0.0001;
    10     end
    11     m = size(X,1);
    12     w = zeros(m,1);
    13     grad = calGrad(X,t,w);
    14     while (norm(grad)>=err)
    15         w = w-calHess(X)grad;
    16         grad = calGrad(X,t,w);
    17     end
    18 end
    19 
    20 function grad = calGrad(X,t,w)
    21     % Gradient of the Error Function
    22     [m,n] = size(X);
    23     grad = zeros(m,1);
    24     for i = 1:n
    25         grad = grad+(w'*X(:,i)-t(i))*X(:,i);
    26     end
    27 end
    28 
    29 function hess = calHess(X)
    30     % Hessian Matrix of the Error Function
    31     m = size(X,1);
    32     hess = zeros(m);
    33     for i = 1:m
    34         for j = 1:m
    35             hess(i,j) = X(i,:)*X(j,:)';
    36         end
    37     end
    38 end

      In frequentist viewpoint of Model Complexity, the expected square loss can be decomposed into a squared bias (the difference between the average prediction and the desired one), a variance term (sensitivity to data sets) and a constant noise term. A relatively rigid model (e.g. MAP with large lambda) has lower variance but higher bias compared with a flexible model (e.g. MLE), and this is the so-called bias-variance trade-off. 

      Bayesian model comparison would choose model averaging instead, which is also known as the fully Bayesian treatment. Given the prior distribution of models (hyper-parameters) $p(M_i)$ and the marginal likelihood $p(D ext{ | }M_i)$ (a convolution of $p(D ext{ | }vec{w},M_i)$ and $p(vec{w} ext{ | }M_i)$, we can deduce the posterior distribution of models $p(M_i ext{ | }D)$. To make a prediction, we just marginalize with respect to both the parameters and hyper-parameters.

      However, the computations in fully Bayesian treatment is usually intractable. To make an approximation, we can carry on model selection in light of the model posterior distribution (MAP estimate), and just marginalize over parameters. Here we take linear regression for example to illustrate how to implement such evidence approximation once having the optimal hyper-parameters.

      First of all, given $p(vec{w})=Gauss(vec{w} ext{ | }vec{m}_0,S_0)$, we can infer the posterior distribution of the parameters:

        $p(vec{w} ext{ | }X,vec{t})=Gauss(vec{w} ext{ | }vec{m}_N,S_N)$,  where  $vec{m}_N=S_Ncdot(S_0^{-1}vec{m}_0+eta X^Tvec{t})$,   $S_N^{-1}=S_0^{-1}+eta X^T X$.

      Then we shall calculate the convolution of the likelihood of $vec{t}$ and the posterior of $vec{w}$ to get the predictive distribution:

        $p(t ext{ | }vec{x},X,vec{t})=Gauss(t ext{ | }vec{m}_N^Tcdot X,eta^{-1}+X^T S_N X)$

      We can easily find that the prediction value (the mean vector) is a linear combination of the training set target variables.

    2. Logistic Regression Model

      Discriminative models are classification models that we presume a certain parametric form of the posterior distribution $p(C_k ext{ | }vec{x})$ (e.g. softmax functions), and then either (1) estimate the parameters by maximizing the likelihood in a Frequentist way, or (2) iteratively get the posterior of parameters in a Bayesian approach (for later marginalization).

      For example, in 2-class Logistic Regression, we presume the form of $p(C_1 ext{ | }vec{x})$ is $y(vec{x})=sigma(vec{w}^Tvec{x}+b)$, where $sigma(a)=(1+e^{-a})^{-1}$, and by taking the negative logarithm of the likelihood we get cross-entropy error function:

         $E(vec{w})=-ln p(vec{t} ext{ | }vec{w})=-sum_{n=1}^N{t_ncdot ln{y_n}+(1-t_n)cdot ln(1-y_n)}$

    where $t_n$ indicates whether $vec{x}_n$ belongs to $C_1$, and $y_n=p(C_1 ext{ | }vec{x}_n)$. To minimize it, we can use Newton-Raphson method to update the parameters iteratively:

         $vec{w}^{(new)}=vec{w}^{(old)}-(X^T R X)^{-1}cdot X^Tcdot(vec{y}-vec{t})$

     1 function y = logRegress(X,t,x)
     2     % Logistic Regression for 2-Class Classification
     3     % Precondtion: X is a set of data columns for training,
     4     %       row vector t is the labels of X (+1 or -1)
     5     %       x is a data column for testing
     6     % Postcondition: the predicted value of data x
     7     m = size(X,1);
     8     options = optimset('GradObj','on','MaxIter',1000);
     9     w = fminunc(@logRegCost,rand(m,1),options);
    10     function [val,grad] = logRegCost(w)
    11         % Determine the value and the gradient
    12         %       of the error function
    13         q = (t+1)/2;
    14         p = 1./(1+exp(-w'*X));
    15         val = -q*log(p')-(1-q)*log(1-p');
    16         grad = X*(p-q)'/size(q,2);
    17     end
    18     y = 2*round(1./(1+exp(-w'*x)))-1;
    19 end

       When it comes to Bayesian method, we should first know the technique of Laplace Approximation, which construct a Gaussian distribution to approximate a function with a global maximum point $vec{z}_0$:

        

    where $A$ is the Hessian matrix of negative logarithm of $f(vec{z})$ at the point $vec{z}_0$.

      To tackle logistic regression in a Bayesian way, given a Gaussian prior $p(vec{w})=Gauss(vec{w} ext{ | }vec{m}_0,S_0)$, we should first use the method above to approximate the posterior distribution $p(vec{w} ext{ | }X,vec{t})$ as $Gauss(vec{w} ext{ | }w_{MAP},S_N)$, and then approximate the predictive distribution (the convolution of $p(C_1 ext{ | }vec{x},vec{w})$ and $p(vec{w} ext{ | }X,vec{t})$) as $sigma(kappa(sigma_a^2)mu_a)$, where $kappa(sigma^2_a)=(sqrt{1+pisigma_a^2/8})^{-1}$, $sigma_a=X^T S_N X$, and $mu_a=vec{w}_{MAP}^Tcdot X$

    References:

      1. Bishop, Christopher M. Pattern Recognition and Machine Learning [M]. Singapore: Springer, 2006

  • 相关阅读:
    Android获取屏幕分辨率及DisplayMetrics简介(轉)
    程序员技术练级攻略
    CSDN精选iPhone开发博客
    Java中访问权限修饰符public protected 缺省默认 private的用法总结(转)
    Java中abstract class 和 interface 的解释和他们的异同点(转)
    Code Project精彩系列(转)
    java中重载与重写的区别
    HDU 1024 Max Sum Plus Plus(动态规划,给定一个数组,求其分成m个不相交子段和最大值的问题)
    Triple ACM HDU 3908 (数学题,找多少种组合)
    ACM POJ 1015 Jury Compromise(陪审团的人选,动态规划题,难)
  • 原文地址:https://www.cnblogs.com/DevinZ/p/4472476.html
Copyright © 2011-2022 走看看