Logistic Regression 逻辑回归
Email: Spam/Not Spam? 电子邮件是否是垃圾邮件
Online Transactions: Fraudulent(Yes / No)? 网上交易是否是诈骗
Turmor: Malignant / Benign? 肿瘤是良性还是恶性
(y in {0, 1}) 要预测的变量(y)能够取(0)和(1)两个值
(0): "Negative Class" (e.g., benign tumor) 通常标记为(0)的类称为“负类”,如良性肿瘤
(1): "Positive Class" (e.g., malignant tumor) 通常标记为(1)的类称为“正类”,如恶性肿瘤
If (h_ heta(x) geq 0.5), predict "(y = 1)"
If (h_ heta(x) leq 0.5), predict "(y = 0)"
分类问题预测的变量(y)只能是(0)或(1),而(h_ heta(x))有时会(>1)或(<0)。
--> 逻辑回归:(0 leq h_ heta(x) leq 1)(虽然名字中有“回归”,但实际上是个分类算法)
Hypothesis Representation 假设函数的表达式
Logistic Regression Model 逻辑回归模型
want (0 leq h_ heta(x) leq 1)
--> 另(h_ heta(x) = g( heta^Tx)), 其中(g(z) = frac{1}{1 + e^{-z}}), 称为逻辑函数(Sigmoid function/Logistic function)(这也是逻辑回归这个名字的由来)。
--> (h_ heta(x) = frac{1}{1 + e^{- heta^Tx}})
Interpretation of Hypothesis Output 对假设输出结果的解释
(h_ heta(x)) = estimated probablity that (y = 1) on input (x).
Example: If (x = left[ egin{matrix} x_0 \ x_1 end{matrix} ight] = left[ egin{matrix} 1 \ tumorSize end{matrix} ight]),(h_ heta(x) = 0.7), tell patient that 70% chance of tumot being malignat. 对于一个特征值为x的患者,y = 1的概率是0.7,我将告诉我的病人肿瘤是恶性的可能性是70%。
(h_ heta(x) = P(y = 1|x; heta)) "probability that (y = 1), given (x), parameterized by ( heta)", (h_ heta)就是给定(x),(y = 1)的概率。上面例子中的(x)就是我的病人的特征(肿瘤的大小)。
(h_ heta(x) = P(y = 0|x; heta) + h_ heta(x) = P(y = 1|x; heta) = 1), (h_ heta(x) = P(y = 0|x; heta) = 1 - h_ heta(x) = P(y = 1|x; heta)).
Decision Boundary 决策边界
Logistic regression
(h_ heta(x) = g( heta^Tx)), (g(z) = frac{1}{1 + e^{-z}}).
Suppose predict "(y = 1)" if (h_ heta(x) geq 0.5), predict "(y = 0)" if (h_ heta(x) < 0.5)
(ecause g(z) geq 0.5) when (z geq 0)
( herefore h_ heta(x) = g( heta^Tx) geq 0.5) when ( heta^Tx geq 0).
( ightarrow) ( heta^Tx geq 0)时(y = 1);( heta^Tx < 0)时,(y = 0)。
Decision Boundary
假设我们有下图所示的一个样例,它的假设函数为(h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2x_2))。请你预测一下当( heta = left[ egin{matrix} -3 \ 1 \ 1 end{matrix} ight])时,"y = 1"的概率。
具体的说,这条直线上对应的点为(h_ heta(x) = 0.5)的点,它将平面划分为了两片区域——分别是假设函数预测(y = 1)的区域和假设函数预测(y = 0)的区域。
【注】决策边界是假设函数的一个属性,它包括参数( heta_0、 heta_1、 heta_2),与数据集无关。
Non-linear decision boundaries 非线性的决策边界
样例如下图,假设函数设为(h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2x_2 + heta_3x_1^2 + heta_4x_2^2)),预测当( heta = left[ egin{matrix} -1 \ 0 \ 0 \ 1 \ 1 end{matrix} ight])时,y = 1的概率。
Cost Function
Linear regression: (J( heta) = frac{1}{m}sum_{i=1}^mfrac{1}{2}(h_ heta(x^{(i)})-y^{(i)})^2)
另(Cost(h_ heta(x), y) = frac{1}{2}(h_ heta(x)-y)^2)
由于(h_ heta(x))函数是非线性的,故在这种情况下(J( heta))是非凸函数(下图左)。但我们使用梯度下降算法必须要求(J( heta))为凸函数(下图右)才可以。
Logistic regression cost function 逻辑回归的代价函数
当(y = 1)时的详细解释:当假设函数的值和预测值都为(1)时,代价是(0);但是当假设函数值为(0)预测值为(1)时,代价是(infty)。(y = 0)时道理相同,图像刚好相反。
Simplified Cost Function and Gradient Descent
(Cost(h_ heta(x), y) = -ylog(h_ heta(x)) - (1-y)log(1-h_ heta(x)))
( ightarrow) (J( heta) = frac{1}{m}sum_{i=1}^mCost(h_ heta(x), y) = -frac{1}{m}left[sum_{i=1}^mylog(h_ heta(x)) + (1-y)log(1-h_ heta(x)) ight])
接下来我们要做的就是想办法为训练集拟合出一个参数( heta),使得(J( heta))能取得最小值。而最小化(J( heta))的方法就是使用梯度下降法。
Gradient descent
Want (min_ heta J( heta)):
Repeat {
$ heta_j := heta_j - alphafrac{partial}{partial heta_j}J( heta) $
} -
Want (min_ heta J( heta)):
Repeat {
( heta_j := heta_j - alphasum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x_j^{(i)})
} -
- 线性回归:(h_ heta(x) = heta^Tx)
- 逻辑回归:(h_ heta(x) = frac{1}{1+e^{- heta^T x}})
Advanced Optimization 高级优化
Optimization algorithm 优化算法
Given ( heta), we have code that can compute (J( heta)、frac{partial}{partial heta_j}J( heta)) (for (j) = 0, 1, …, n)
Optimization algorithms:
- Gradient descent
- Conjugate gradient 共轭梯度法
- No need to manually pick (alpha)
- Often faster than gradient descent
- More complex 太复杂了很难搞清楚其原理
( heta = left[ egin{matrix} heta_1 \ heta_2 end{matrix} ight])
(J( heta) = ( heta_1 - 5)^2 + ( heta_2 - 5)^2)
(frac{partial}{partial heta_1}J( heta) = 2( heta_1 - 5))
(frac{partial}{partial heta_2}J( heta) = 2( heta_2 - 5))
function [jVal, gradient] = costFunction(theta)
jVal = (theta(1) - 5) ^2 + (theta(2) - 5)^2;
gradient = zeros(2, 1);
gradient(1) = 2 * (theta(1) - 5);
gradient(2) = 2 * (theta(2) - 5);
函数的指针。它会自动选择学习速率(alpha),然后尝试使用这些高级的优化算法,就像加强版的梯度下降法,为你找到最佳的( heta)值。
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2, 1)
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options)
Multiclass Classification: One-vs-all
Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby.
Medical diagrams: Not ill, Cold, Flu.
Weather: Sunny, Coludy, Rain, Snow.
One-vs-all 一对多的方法
以三角形为例,将其定为正类,另外两种定为负类,我们创建一个新的训练集。接着拟合出一个合适的分类器,可记为(h_ heta^{(1)}(x))。
接着将正方形定为正类,另外两种定为负类……便可得到(h_ heta^{(2)}(x))、(h_ heta^{(3)}(x))。
Train a logistic regression classifier (h_ heta^{(i)}(x)) for each class (i) to predict the probablity that (y = i). 对每一个可能的(y = i)都训练出一个逻辑回归分类器(h_ heta^{(i)}(x))。
On a new input (x), to make a prediction, pick the class (i) that maximizes (max_ih_ heta^{(i)}(x)). 对于给出的(x)值,我们在我们得到的分类器里分别输入(x)值,然后选择一个让(h)最大的(i)。
Suppose that you have trained a logistic regression classifier, and it outputs on a new example (x) a prediction (h_ heta(x)) = 0.7. This means (check all that apply):
- [x] Our estimate for (P(y=1|x; heta)) is 0.7.
- [x] Our estimate for (P(y=0|x; heta)) is 0.3.
- [ ] Our estimate for (P(y=0|x; heta)) is 0.7.
- [ ] Our estimate for (P(y=1|x; heta)) is 0.3.
Suppose you have the following training set, and fit a logistic regression classifier (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2x_2)). Which of the following are true? Check all that apply.
- [x] Adding polynomial features (e.g., instead using (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2 x_2 + heta_3 x_1^2 + heta_4 x_1 x_2 + heta_5 x_2^2))) could increase how well we can fit the training data. 这种数据线性回归并不适用,适当的增加多项式特性,可以提高对数据的适应。
- [x] At the optimal value of ( heta) (e.g., found by fminunc), we will have (J( heta) geq 0).
- [ ] Adding polynomial features (e.g., instead using (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2 x_2 + heta_3 x_1^2 + heta_4 x_1 x_2 + heta_5 x_2^2)) ) would increase (J( heta)) because we are now summing over more terms. 将会减小(J( heta))。
- [ ] If we train gradient descent for enough iterations, for some examples (x^{(i)}) in the training set it is possible to obtain (h_ heta(x^{(i)}) > 1). (0 lt h_ heta(x^{(i)}) lt 1)。
- [x] (J( heta)) will be a convex function, so gradient descent should converge to the global minimum.
- [ ] The positive and negative examples cannot be separated using a straight line. So, gradient descent will fail to converge.
- [ ] Because the positive and negative examples cannot be separated using a straight line, linear regression will perform as well as logistic regression on this data.
For logistic regression, the gradient is given by (frac{partial}{partial heta_j}J( heta) = frac{1}{m} sum^m_{i=1}(h_ heta(x^{(i)})-y^{(i)})x_j^{(i)}). Which of these is a correct gradient descent update for logistic regression with a learning rate of (alpha)? Check all that apply.
- [x] ( heta_j := heta_j - alphafrac{1}{m}sum^m_{i=1}(frac{1}{1+e^{- heta^Tx^{(i)}}}-y^{(i)})x_j^{(i)}) (simultaneously update for all (j)).
- [x] ( heta_j := heta_j - alphafrac{1}{m}sum^m_{i=1}(h_ heta(x^{(i)}) - y^{(i)})x_j^{(i)}) (simultaneously update for all (j)).
- [ ] ( heta := heta - alphafrac{1}{m}sum^m_{i=1}( heta^Tx - y^{(i)})x^{(i)}). 线性回归
- [ ] ( heta_j := heta_j - alphafrac{1}{m}sum^m_{i=1}(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}) (simultaneously update for all (j)).
Which of the following statements are true? Check all that apply.
- [x] The one-vs-all technique allows you to use logistic regression for problems in which each (y^{(i)}) comes from a fixed, discrete set of values. 将要分类的定为正类,其它定为负类。
- [x] The cost function (J( heta)) for logistic regression trained with (m geq 1) examples is always greater than or equal to zero. (J( heta) ge 0)。
- [ ] For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc). 高级优化算法优点:不用挑选学习速率(alpha),通常运行较快。
- [ ] Since we train one classifier when there are two classes, we train two classifiers when there are three classes (and we do one-vs-all classification). 3个类时要训练3个分类器。
- [x] The sigmoid function (g(z)=frac{1}{1+e^{−z}}) is never greater than one (>1). sigmoid函数的取值范围是(0,1)。
- [ ] Linear regression always works well for classification if you classify by using a threshold on the prediction made by linear regression. 分类问题,要么0, 要么1, 没有什么threshold一说。
Suppose you train a logistic classifier (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2 x_2)). Suppose ( heta_0 = 6, heta_1 = -1, heta_2 = 0). Which of the following figures represents the decision boundary found by your classifier?
- [x]