zoukankan      html  css  js  c++  java
  • Logistic Regression

    Motivation

    If y only takes a finite set of discrete values such as {0,1}, then using Linear Regression to predict a (hat y>1/hat y<0) does not make sense at all. But fortunately we can fix Linear Regression to produce a value between [0,1].

    Details

    We choose sigmoid/logistic function to map the value:

    [h_ heta(x)=g( heta^Tx),g(z)=frac{1}{1+e^{-z}} ]

    在这里插入图片描述
    We can assume that:

    [h_ heta(x)=P(y=1|x; heta)\ 1-h_ heta(x)=P(y=0|x; heta)]

    Or more compactly:

    [p(y|x; heta)=[h_ heta(x)]^y[1-h_ heta(x)]^{1-y} ]

    Now we will use maximum likelihood to fit parameters ( heta), assume n training examples are independent, then the likelihood of the parameters is:

    [L( heta)=p(vec y|X; heta)=prod_{i=1}^{n}p(y^{(i)}|x^{(i)}; heta)=prod_{i=1}^{n}[h(x^{(i)})]^{y^{(i)}}[1-h(x^{(i)})]^{1-y^{(i)}} ]

    To make life easier, we use the log likelihood:

    [l( heta)=log L( heta)=sum_{i=1}^{n}y^{(i)}log h(x^{(i)})+(1-y^{(i)})log (1-h(x^{(i)})) ]

    Let's first take out one example ((x,y)) to derive the stochastic gradient ascent rule:

    [frac{partial}{partial heta_j}l( heta)=[yfrac{1}{g( heta^Tx)}-(1-y)frac{1}{1-g( heta^Tx)}]frac{partial}{partial heta_j}g( heta^Tx) \=[yfrac{1}{g( heta^Tx)}-(1-y)frac{1}{1-g( heta^Tx)}]g( heta^Tx)(1-g( heta^Tx))frac{partial}{partial heta_j} heta^Tx \=[y(1-g( heta^Tx))-(1-y)g( heta^Tx)]x_j=(y-h_ heta(x))x_j ]

    Then we can update the parameters:

    [ heta_j= heta_j+alpha(y^{(i)}-h_{ heta}(x^{(i)}))x_j^{(i)} ]

    Here we use maximum likelihood to get the update rule. Generally we would like to minimize the object function. So we can add a negative sign to the maximum likelihood's formula, it is called logistic loss. Thus there exists another way to understand it.

    The loss on a single sample can be formulated as follows:

    [cost(h_{ heta}(x),y)=left{ egin{aligned} -log(h_{ heta}(x)) if y=1\ -log(1-h_{ heta}(x)) if y=0 end{aligned} ight. ]

    If y=1 and the prediction=1, then loss=0; else if y=1 and the prediction=0, then loss=(+infin) is a huge penalty for the totally wrong prediction. It is the same for y=0.

    We can unify the two cases together and the loss for the whole training data is:

    [cost((h_{ heta}(x),y))=-ylog(h_{ heta}(x))-(1-y)log(1-h_{ heta}(x))\=-frac{1}{m}sum_{i=1}^{m}[y^{(i)}log(h_{ heta}(x^{(i)}))+(1-y^{(i)})log(1-h_{ heta}(x^{(i)}))] ]

    Here the reason why we don't use the MSE loss such as Linear Regression is that the (J( heta)) is non-convex and very hard to optimize for the global optimum.

    To make life easier again, we can write the formula as the vectorized version:

    [h = g(X heta),J( heta) = frac{1}{m} cdot left(-y^{T}log(h)-(1-y)^{T}log(1-h) ight) ]

    Then our goal is to minimize (J( heta)) and get appropriate parameters ( heta) and use (h_ heta(x)=frac{1}{1+e^{- heta^Tx}}) to get our predictions.

    Since it is a little complex to get answer analytically, so we still use Gradient Descent to minimize the loss numerically. The update rule is the same as the above one:

    [ heta_j= heta_j+alphafrac{1}{m}sum_{i=1}^{m}(y^{(i)}-h_{ heta}(x^{(i)}))x_j^{(i)} ]

    Here you should notice that all ( heta_j) should be updated simultaneously when you program. Again the vectorized version:

    [ heta= heta-frac{alpha}{m}X^T[g(X heta)-y] ]

    It is the same formula as the Linear Regression except that (h_ heta(x)) is different.

    牛顿法

    除了用梯度上升法去最大化(l( heta)),牛顿迭代法也能干这件事。

    普通同学都是在求方程的零点(f( heta)=0)时接触到牛顿法,其更新规则为:

    [ heta= heta-frac{f( heta)}{f^{'}( heta)} ]

    这个规则可以理解为:我们一直在用一个线性函数去近似(f),因此希望下一次迭代的( heta)就是该线性函数的零点:
    在这里插入图片描述
    再结合一点高中数学,(l( heta))极大值点处的一阶导数为0,因此只要令(l^{'}( heta)=0)就能解出对应的( heta)

    [ heta= heta-frac{l^{'}( heta)}{l^{''}( heta)} ]

    由于逻辑回归中( heta)是向量而非scalar,因此需要稍稍改变下更新规则:

    [ heta= heta-H^{-1} abla_{ heta}l( heta) ]

    其中,Hessian阵中的元素为(H_{ij}=frac{partial^2l( heta)}{partial heta_ipartial heta_j})

    牛顿法通常比梯度上升收敛快得多,因为利用了(l( heta))的二阶信息,但是存储和求解(H^{-1})开销会比较大。

  • 相关阅读:
    Sets 比赛时想错方向了。。。。 (大数不能处理负数啊)
    Power Sum 竟然用原根来求
    Dynamic Inversions II 逆序数的性质 树状数组求逆序数
    Lowbit Sum 规律
    Dynamic Inversions 50个树状数组
    Muddy Fields
    组合 Lucas定理
    GCD SUM 强大的数论,容斥定理
    Liers 树状数组+中国剩余定理
    C#中提取文件路径的目录的各种操作
  • 原文地址:https://www.cnblogs.com/EIMadrigal/p/12130859.html
Copyright © 2011-2022 走看看