Classification
It's not a good idea to use linear regression for classification problem.
We can use logistic regression algorism, which is a classification algorism
想要(0le h_{ heta}(x) le 1), 只需要使用sigmoid function (又称为logistic function)
(h_ heta(x))的意义在于: (h_ heta(x)) = estimated probability that (y = 1) on input (x)
注意:(x=0)时,(g(z))刚好等于0.5
Decision Boundary
(h_ heta{(x)} == P{y=1|x;0 }) ((P)指预测的概率)
在课上的例子中,(h_ heta(x) ge 0.5,则y=1, else; y=0)
不妨设( heta = egin{bmatrix}-3\ 1\ 1 end{bmatrix} ,则 h_ heta(x)=g(-3+x_1+x_2))
由于"(y=1)" == "(h_ heta(x) ge 0.5)" == "( heta^Tx ge 0)" == "(-3+x_1+x_2 ge 0)"
这样的到了 "(y=1)" == "(x_1+x_2 ge 3)"
(x_1+x_2) 与 (3) 的关系决定了 (y) 的值,这就是Decision boundary(决策边界)
拓展到 Non-linear decision boundary:
还可以有:Predict "(y=1)" if (-1+x_1^2+x_2^2 ge 0) (( heta = egin{bmatrix}-1\ 0\ 0 \ 1\ 1 end{bmatrix},;x = egin{bmatrix}x_0\ x_1\ x_2\ x_3 \ x_4 end{bmatrix} = egin{bmatrix}1\ x_1\ x_2\ x_1^2 \ x_2^2 end{bmatrix}))
通过( heta)的不同选择与(x)的不同构造可以得到各种形状的决策边界
而Decision Boundary 取决于参数 ( heta) 的选择,并非由训练集决定
我们需要用训练集来拟合参数 ( heta)
Cost Function
在之前的 linear regression 中,用的Cost函数是:$Cost(h_ heta(x,y)) = frac{1}{2}(h_ heta(x,y))^2 $
但那不是通用的,在hypothesis function (h_ heta(x))不再是线性方程的情况下,若再采用$Cost(h_ heta(x,y)) = frac{1}{2}(h_ heta(x,y))^2 (会导致)J( heta)$ 有着众多的local optima,而不是我们想要的convex function
Logistic Regression Cost Function
当 (h_ heta(x)=y) 时,(Cost(h_ heta(x,y))=0),
当 (y=1,h_ heta(x) ightarrow0) 时 (Cost ightarrow infty),此时:( heta^Tx ightarrow -infty)
当 (y=0,h_ heta(x) ightarrow1) 时 (Cost ightarrow infty),此时:( heta^Tx ightarrow infty)
这样就保证了( heta)的调整能使得(h_ heta(x)) 向 (y) 靠近,也就是预测效果与实际更加符合
上面的(Cost) function 也可以写成:
这与之前的cases形式是等价的
所以:
Gradient Descent Algorithm的通用形式还是跟linear regression的一样(当然把(h_ heta(x))展开后就不一样了):
Other Optimization Algorism
- Conjugate Algorism(共轭梯度法)
- BFGS(Broyden–Fletcher–Goldfarb–Shanno algorithm)
- L-BFGS( Limited-memory BFGS)
advantage:
- no need to manually pick (alpha)
- Often faster than gradient descent
disadvantage:
- More complex
不建议自己写,但是...可以直接调库啊
%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(theta)]
gradient = zeros(n+1,1)
gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]
gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
...
gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]] %the matrix in Octave starts from 1
%}
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);
Multiclass Classification:
用one-vs-all(一对多/一对余)的思想
对每一类都分成"这一类" 与 "剩下的所有类的集合" 两类,然后用之前的课程中讲得分类方法拟合出这一类的分类器(classifier)
(classifier 就是hypothesis)
最后得出(n)个classifiers, 其中(n)是类别的总数量, (y)是类别:
也就是说,给定(x)和( heta), (h_ heta^{(i)}(x)) 能算出来类别是(i)类的概率
然后输入一个新的input (x)时,作出预测的行为是:(underbrace{max}_i(h_ heta^{(i)}(x)))
Regularization (正则化)
解决overfitting(过拟合)的问题,另一个描述这个问题的词语是high variance(高方差)
这是 过多变量(feature)+ 过少训练数据 造成的
If we have too many features, the learned hypothesis may fit the training set very well((J( heta) approx 0))
generalize: how well a hypothesis applies even to new examples
Option to address overfitting:
- Reduce number of features:
- Manually select which features to keep
- Model selection algorism
- Regularization:
- Keep all features, but reduce magnitude(大小)/values of parameters ( heta_j)
- Works well when having a lot of features , each of which contributes a bit to predicting (y)
regularized Linear Regression
Regularization 的思路:
Small values for parameters ( heta_0, heta_1,dots, heta_n):
- "Simpler" hypothesis
- Less prone to overfitting
也就是将某些影响过大的( heta_j)设得很小,比如: ( heta_0 + heta_1x + heta_2x^2 + heta_3x^3 + heta_4x^4 approx heta_0 + heta_1x + heta_2x^2)
Gradient Descent
但是这个regularization 的过程不是在 (h_ heta(x)) 里进行的,而是在Cost Function (J( heta))里进行的:
注意后面加上的那一项(称之为正则化项)是从1开始的,它收缩了除了( heta_0)外的每一个参数。 (lambda) 称为regularization parameter(正则化参数),用于控制两个不同目标之间的平衡关系。
在这个cost functions 里两个(sum)项代表了两个不同的目标:
- 使假设更好地拟合数据(fit the training data well)
- 保持参数值较小(keep the parameters small)
较小的参数值能得到简单的hypothesis,从而避免overfitting
注意:(lambda)不能过大,否则会使得 ( heta_1,dots , heta_n approx 0), 从而fail to fit even the training set ——too high bias——underfitting(欠拟合)
亦即:
Normal Equation
review: 之前的Normal Equation是 ( heta = (X^TX)^{-1}X^Ty)
改成( heta = (X^TX+lambda small{egin{bmatrix}0 \&1 \ &&1\&&&ddots\&&&&1 end{bmatrix}})^{-1}X^Ty,quad large ext{if }lambda gt 0)
关于不可逆/退化矩阵 的问题,还是用Octave中的pinv()
可以取伪逆矩阵
但是只要确保(lambda)严格大于0,就能证明括号里的两个矩阵的和是可逆的.....
Regularized Logistic Regression
review: $ J( heta) = -frac{1}{m}[sum_{i=1}{m}y{(i)}, log,h_ heta(x{(i)})+(1-y{(i)}), log,(1-h_ heta(x^{(i)}))]$
处理方法与Linear Regression 的一样,都是在式子最后面加上一个正则化项 (frac{lambda}{2m}sum_{j=1}^m heta_j^2)
Gradient Descent(general 形式跟Linear Regression的一样,区别还是只有(h_ heta(x^{(i)}))不同):
在Octave中还是用之前的代码模版就行,注意在算(frac{partial J( heta)}{partial heta_j};(small j=1,2,dots,n))时需要注意把正则化项的偏微分加上
%{
%a function's definition, return the costFunction in 'jVal' and the Partial derivative in 'gradient'
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(theta)]
gradient = zeros(n+1,1)
gradient(1) = [code to compute ∂[J(theta)]/∂[theta(0)]]
gradient(2) = [code to compute ∂[J(theta)]/∂[theta(1)]]
...
gradient(n+1) [code to compute ∂[J(theta)]/∂[theta(n)]] %the matrix in Octave starts from 1
%}
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functional, exitFlag] = fminunc(@costFunction, initialTheta, options);