zoukankan      html  css  js  c++  java
  • Machine Learning--week3 逻辑回归函数(分类)、决策边界、逻辑回归代价函数、多分类与(逻辑回归和线性回归的)正则化

    Classification

    It's not a good idea to use linear regression for classification problem.

    We can use logistic regression algorism, which is a classification algorism

    想要(0le h_{ heta}(x) le 1), 只需要使用sigmoid function (又称为logistic function)

    [large h_ heta(x) = g( heta^Tx), quad其中;g(z) =frac{1}{1+e^{-z}} ]

    (h_ heta(x))的意义在于: (h_ heta(x)) = estimated probability that (y = 1) on input (x)

    注意:(x=0)时,(g(z))刚好等于0.5

    Decision Boundary

    (h_ heta{(x)} == P{y=1|x;0 }) ((P)指预测的概率)

    ​ 在课上的例子中,(h_ heta(x) ge 0.5,则y=1, else; y=0)

    ​ 不妨设( heta = egin{bmatrix}-3\ 1\ 1 end{bmatrix} ,则 h_ heta(x)=g(-3+x_1+x_2))

    ​ 由于"(y=1)" == "(h_ heta(x) ge 0.5)" == "( heta^Tx ge 0)" == "(-3+x_1+x_2 ge 0)"

    这样的到了 "(y=1)" == "(x_1+x_2 ge 3)"

    (x_1+x_2)(3) 的关系决定了 (y) 的值,这就是Decision boundary(决策边界)

    拓展到 Non-linear decision boundary:

    ​ 还可以有:Predict "(y=1)" if (-1+x_1^2+x_2^2 ge 0) (( heta = egin{bmatrix}-1\ 0\ 0 \ 1\ 1 end{bmatrix},;x = egin{bmatrix}x_0\ x_1\ x_2\ x_3 \ x_4 end{bmatrix} = egin{bmatrix}1\ x_1\ x_2\ x_1^2 \ x_2^2 end{bmatrix}))

    ​ 通过( heta)的不同选择与(x)的不同构造可以得到各种形状的决策边界

    ​ 而Decision Boundary 取决于参数 ( heta) 的选择,并非由训练集决定

    ​ 我们需要用训练集来拟合参数 ( heta)

    Cost Function

    [egin{align} &J( heta) =frac{1}{m}sum_{i=1}^{m}Cost(h_ heta(x^{(i)}),y^{(i)})end{align} ]

    在之前的 linear regression 中,用的Cost函数是:$Cost(h_ heta(x,y)) = frac{1}{2}(h_ heta(x,y))^2 $

    但那不是通用的,在hypothesis function (h_ heta(x))不再是线性方程的情况下,若再采用$Cost(h_ heta(x,y)) = frac{1}{2}(h_ heta(x,y))^2 (会导致)J( heta)$ 有着众多的local optima,而不是我们想要的convex function

    Logistic Regression Cost Function

    [Cost(h_ heta(x),y) = egin{cases} egin{align} {-log(h_ heta(x))} &quad ext{ if $y$ = 1} \ {-log(1-h_ heta(x))} &quad ext{ if $y$ = 0} end{align} end{cases} ]

    (h_ heta(x)=y) 时,(Cost(h_ heta(x,y))=0),

    (y=1,h_ heta(x) ightarrow0)(Cost ightarrow infty),此时:( heta^Tx ightarrow -infty)

    (y=0,h_ heta(x) ightarrow1)(Cost ightarrow infty),此时:( heta^Tx ightarrow infty)

    这样就保证了( heta)的调整能使得(h_ heta(x))(y) 靠近,也就是预测效果与实际更加符合

    上面的(Cost) function 也可以写成:

    [Cost(h_ heta(x),y) = -ycdot log(h_ heta(x))-(1-y)cdot log(1-h_ heta(x)) ]

    这与之前的cases形式是等价的

    所以:

    [egin{align} J( heta) &=frac{1}{m}sum_{i=1}^{m}Cost(h_ heta(x^{(i)}),y^{(i)})\ &= -frac{1}{m}[sum_{i=1}^{m}y^{(i)}cdot log(h_ heta(x^{(i)}))+(1-y^{(i)})cdot log(1-h_ heta(x^{(i)}))] end{align} ]

    Gradient Descent Algorithm的通用形式还是跟linear regression的一样(当然把(h_ heta(x))展开后就不一样了):

    [egin{align}& ext{Repeat{} \ &qquad heta_j := heta_j - alphasum_{i=1}^{m}(h_ heta(x^{(i)})-y^{(i)})x_j^{(i)}\ &} end{align} ]

    Other Optimization Algorism

    • Conjugate Algorism(共轭梯度法)
    • BFGS(Broyden–Fletcher–Goldfarb–Shanno algorithm)
    • L-BFGS( Limited-memory BFGS)

    advantage:

    • no need to manually pick (alpha)
    • Often faster than gradient descent

    disadvantage:

    • More complex

    不建议自己写,但是...可以直接调库啊

    %{
    %a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
    function [jVal, gradient] = costFunction(theta)
    	jVal = [code to compute J(theta)]
    	gradient = zeros(n+1,1)
    	gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]] 
    	gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
    	...
    	gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1
    %}
    
    options = optimset('GradObj', 'on', 'MaxIter', '100');
    initialTheta = zeros(2,1);
    [optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);
    

    Multiclass Classification:

    用one-vs-all(一对多/一对余)的思想

    对每一类都分成"这一类" 与 "剩下的所有类的集合" 两类,然后用之前的课程中讲得分类方法拟合出这一类的分类器(classifier)

    (classifier 就是hypothesis)

    最后得出(n)个classifiers, 其中(n)是类别的总数量, (y)是类别:

    [h_ heta^{(i)}(x) = P(y=i|x; heta)qquad (i=1,2,3,dots,n) ]

    也就是说,给定(x)( heta)(h_ heta^{(i)}(x)) 能算出来类别是(i)类的概率

    然后输入一个新的input (x)时,作出预测的行为是:(underbrace{max}_i(h_ heta^{(i)}(x)))

    Regularization (正则化)

    解决overfitting(过拟合)的问题,另一个描述这个问题的词语是high variance(高方差)

    这是 过多变量(feature)+ 过少训练数据 造成的

    ​ If we have too many features, the learned hypothesis may fit the training set very well((J( heta) approx 0))

    generalize:  how well a hypothesis applies even to new examples

    Option to address overfitting:

    • Reduce number of features:
      • Manually select which features to keep
      • Model selection algorism
    • Regularization:
      • Keep all features, but reduce magnitude(大小)/values of parameters ( heta_j)
      • Works well when having a lot of features , each of which contributes a bit to predicting (y)

    regularized Linear Regression

    Regularization 的思路:

    Small values for parameters ( heta_0, heta_1,dots, heta_n):

    • "Simpler" hypothesis
    • Less prone to overfitting

    也就是将某些影响过大的( heta_j)设得很小,比如: ( heta_0 + heta_1x + heta_2x^2 + heta_3x^3 + heta_4x^4 approx heta_0 + heta_1x + heta_2x^2)

    Gradient Descent

    但是这个regularization 的过程不是在 (h_ heta(x)) 里进行的,而是在Cost Function (J( heta))里进行的:

    [large J( heta) =frac{1}{2m} [sum_{i=1}^{m}(h_ heta(x^{(i)})-y^{(i)})^2 + lambdasum_{j=1}^{n} heta_j^2 ] ]

    注意后面加上的那一项(称之为正则化项)是从1开始的,它收缩了除了( heta_0)外的每一个参数。 (lambda) 称为regularization parameter(正则化参数),用于控制两个不同目标之间的平衡关系。

    在这个cost functions 里两个(sum)项代表了两个不同的目标:

    • 使假设更好地拟合数据(fit the training data well)
    • 保持参数值较小(keep the parameters small)

    较小的参数值能得到简单的hypothesis,从而避免overfitting

    注意:(lambda)不能过大,否则会使得 ( heta_1,dots , heta_n approx 0), 从而fail to fit even the training set ——too high bias——underfitting(欠拟合)

    [egin{align} & ext{repeat until convergence}{qquadqquadqquadqquadqquad\ &qquad heta_{0}; ext{:= } heta_{0} - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_0^{(i)} \ &qquad heta_{j}; ext{:= } heta_{j} - alpha[frac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_j^{(i)} + frac{lambda}{m} heta_j] qquad (j = 1,2...,n)\ &} end{align} ]

    亦即

    [egin{align} & ext{repeat until convergence}{qquadqquadqquadqquadqquad\ &qquad heta_{0}; ext{:= } heta_{0} - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_0^{(i)} \ &qquad heta_{j}; ext{:= } heta_{j}(1-alphafrac{lambda}{m}) - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_j^{(i)}qquad (j = 1,2...,n)\ &} end{align} ]

    Normal Equation

    review: 之前的Normal Equation是 ( heta = (X^TX)^{-1}X^Ty)

    改成( heta = (X^TX+lambda small{egin{bmatrix}0 \&1 \ &&1\&&&ddots\&&&&1 end{bmatrix}})^{-1}X^Ty,quad large ext{if }lambda gt 0)

    关于不可逆/退化矩阵 的问题,还是用Octave中的pinv()可以取伪逆矩阵

    但是只要确保(lambda)严格大于0,就能证明括号里的两个矩阵的和是可逆的.....

    Regularized Logistic Regression

    review: $ J( heta) = -frac{1}{m}[sum_{i=1}{m}y{(i)}, log,h_ heta(x{(i)})+(1-y{(i)}), log,(1-h_ heta(x^{(i)}))]$

    处理方法与Linear Regression 的一样,都是在式子最后面加上一个正则化项 (frac{lambda}{2m}sum_{j=1}^m heta_j^2)

    [J( heta) = -frac{1}{m}[sum_{i=1}^{m}y^{(i)}\, log\,h_ heta(x^{(i)})+(1-y^{(i)})\, log\,(1-h_ heta(x^{(i)}))] + frac{lambda}{2m}sum_{j=1}^m heta_j^2 ]

    Gradient Descent(general 形式跟Linear Regression的一样,区别还是只有(h_ heta(x^{(i)}))不同):

    [egin{align} & ext{repeat until convergence}{qquadqquadqquadqquadqquad\ &qquad heta_{0}; ext{:= } heta_{0} - alphafrac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_0^{(i)} \ &qquad heta_{j}; ext{:= } heta_{j} - alpha[frac{1}{m} sum_{i=1}^{m} (h_{ heta}(x^{(i)})-y^{(i)})x_j^{(i)} + frac{lambda}{m} heta_j] qquad (j = 1,2...,n)\ &} end{align} ]

    在Octave中还是用之前的代码模版就行,注意在算(frac{partial J( heta)}{partial heta_j};(small j=1,2,dots,n))时需要注意把正则化项的偏微分加上

    %{
    %a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
    function [jVal, gradient] = costFunction(theta)
    	jVal = [code to compute J(theta)]
    	gradient = zeros(n+1,1)
    	gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]] 
    	gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
    	...
    	gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]]      %the matrix in Octave starts from 1
    %}
    
    options = optimset('GradObj', 'on', 'MaxIter', '100');
    initialTheta = zeros(2,1);
    [optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);
    
  • 相关阅读:
    为什么MySQL数据库索引选择使用B+树?
    nginx负载均衡策略
    视频笔记
    mysql show full processlist 分析问题
    Git利用命令行提交代码步骤
    zend studio远程自动上传代码并执行
    CentOS7 通过YUM安装MySQL5.7 linux
    PHP的按位运算符是什么意思
    git 分支操作
    php 技术知识点汇总
  • 原文地址:https://www.cnblogs.com/khunkin/p/10199384.html
Copyright © 2011-2022 走看看