Logistic Regression 逻辑回归
Classification
examples
Email: Spam/Not Spam? 电子邮件是否是垃圾邮件
Online Transactions: Fraudulent(Yes / No)? 网上交易是否是诈骗
Turmor: Malignant / Benign? 肿瘤是良性还是恶性
(y in {0, 1}) 要预测的变量(y)能够取(0)和(1)两个值
(0): "Negative Class" (e.g., benign tumor) 通常标记为(0)的类称为“负类”,如良性肿瘤
(1): "Positive Class" (e.g., malignant tumor) 通常标记为(1)的类称为“正类”,如恶性肿瘤
用线性回归来解决分类问题
If (h_ heta(x) geq 0.5), predict "(y = 1)"
If (h_ heta(x) leq 0.5), predict "(y = 0)"
遇到的问题
分类问题预测的变量(y)只能是(0)或(1),而(h_ heta(x))有时会(>1)或(<0)。
--> 逻辑回归:(0 leq h_ heta(x) leq 1)(虽然名字中有“回归”,但实际上是个分类算法)
Hypothesis Representation 假设函数的表达式
在分类问题中,用什么样的函数来表示我们的假设。
Logistic Regression Model 逻辑回归模型
want (0 leq h_ heta(x) leq 1)
--> 另(h_ heta(x) = g( heta^Tx)), 其中(g(z) = frac{1}{1 + e^{-z}}), 称为逻辑函数(Sigmoid function/Logistic function)(这也是逻辑回归这个名字的由来)。
--> (h_ heta(x) = frac{1}{1 + e^{- heta^Tx}})
Interpretation of Hypothesis Output 对假设输出结果的解释
-
(h_ heta(x)) = estimated probablity that (y = 1) on input (x).
-
Example: If (x = left[ egin{matrix} x_0 \ x_1 end{matrix} ight] = left[ egin{matrix} 1 \ tumorSize end{matrix} ight]),(h_ heta(x) = 0.7), tell patient that 70% chance of tumot being malignat. 对于一个特征值为x的患者,y = 1的概率是0.7,我将告诉我的病人肿瘤是恶性的可能性是70%。
-
(h_ heta(x) = P(y = 1|x; heta)) "probability that (y = 1), given (x), parameterized by ( heta)", (h_ heta)就是给定(x),(y = 1)的概率。上面例子中的(x)就是我的病人的特征(肿瘤的大小)。
-
(h_ heta(x) = P(y = 0|x; heta) + h_ heta(x) = P(y = 1|x; heta) = 1), (h_ heta(x) = P(y = 0|x; heta) = 1 - h_ heta(x) = P(y = 1|x; heta)).
Decision Boundary 决策边界
Logistic regression
(h_ heta(x) = g( heta^Tx)), (g(z) = frac{1}{1 + e^{-z}}).
Suppose predict "(y = 1)" if (h_ heta(x) geq 0.5), predict "(y = 0)" if (h_ heta(x) < 0.5)
(ecause g(z) geq 0.5) when (z geq 0)
( herefore h_ heta(x) = g( heta^Tx) geq 0.5) when ( heta^Tx geq 0).
( ightarrow) ( heta^Tx geq 0)时(y = 1);( heta^Tx < 0)时,(y = 0)。
Decision Boundary
假设我们有下图所示的一个样例,它的假设函数为(h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2x_2))。请你预测一下当( heta = left[ egin{matrix} -3 \ 1 \ 1 end{matrix} ight])时,"y = 1"的概率。
如果我们将假设函数可视化,我们将得到下图所示的一条分界线,这条分界线就叫做决策边界。
具体的说,这条直线上对应的点为(h_ heta(x) = 0.5)的点,它将平面划分为了两片区域——分别是假设函数预测(y = 1)的区域和假设函数预测(y = 0)的区域。
【注】决策边界是假设函数的一个属性,它包括参数( heta_0、 heta_1、 heta_2),与数据集无关。
Non-linear decision boundaries 非线性的决策边界
样例如下图,假设函数设为(h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2x_2 + heta_3x_1^2 + heta_4x_2^2)),预测当( heta = left[ egin{matrix} -1 \ 0 \ 0 \ 1 \ 1 end{matrix} ight])时,y = 1的概率。
决策边界可视化:
Cost Function
初寻代价函数
Linear regression: (J( heta) = frac{1}{m}sum_{i=1}^mfrac{1}{2}(h_ heta(x^{(i)})-y^{(i)})^2)
另(Cost(h_ heta(x), y) = frac{1}{2}(h_ heta(x)-y)^2)
由于(h_ heta(x))函数是非线性的,故在这种情况下(J( heta))是非凸函数(下图左)。但我们使用梯度下降算法必须要求(J( heta))为凸函数(下图右)才可以。
对此我们提出了新的代价函数。
Logistic regression cost function 逻辑回归的代价函数
当(y = 1)时的详细解释:当假设函数的值和预测值都为(1)时,代价是(0);但是当假设函数值为(0)预测值为(1)时,代价是(infty)。(y = 0)时道理相同,图像刚好相反。
Simplified Cost Function and Gradient Descent
寻找一个简单点的方法来写代价函数替代现在的算法以及弄清楚如何运用梯度下降算法来拟合出逻辑回归的参数。
逻辑回归的代价函数的等价写法
(Cost(h_ heta(x), y) = -ylog(h_ heta(x)) - (1-y)log(1-h_ heta(x)))
( ightarrow) (J( heta) = frac{1}{m}sum_{i=1}^mCost(h_ heta(x), y) = -frac{1}{m}left[sum_{i=1}^mylog(h_ heta(x)) + (1-y)log(1-h_ heta(x)) ight])
这个式子是从统计学中的极大似然法中得来的,思路是基于如何为不同的模型有效地找出不同的参数。它还具有一个很好的性质——它是凸的。
接下来我们要做的就是想办法为训练集拟合出一个参数( heta),使得(J( heta))能取得最小值。而最小化(J( heta))的方法就是使用梯度下降法。
Gradient descent
-
原始表达式:
Want (min_ heta J( heta)):
Repeat {
$ heta_j := heta_j - alphafrac{partial}{partial heta_j}J( heta) $
} -
带入上面化简后的式子:
Want (min_ heta J( heta)):
Repeat {
( heta_j := heta_j - alphasum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x_j^{(i)})
} -
与线性回归的不同之处:假设函数的不同
- 线性回归:(h_ heta(x) = heta^Tx)
- 逻辑回归:(h_ heta(x) = frac{1}{1+e^{- heta^T x}})
Advanced Optimization 高级优化
利用高级优化算法和概念我们可以将逻辑回归的速度大大提高,这也将使算法更适合大型的机器学习问题。
Optimization algorithm 优化算法
Given ( heta), we have code that can compute (J( heta)、frac{partial}{partial heta_j}J( heta)) (for (j) = 0, 1, …, n)
Optimization algorithms:
- Gradient descent
- Conjugate gradient 共轭梯度法
- BFGS
- L-BFGS
后三种算法的优点:
- No need to manually pick (alpha)
- Often faster than gradient descent
后三种算法的缺点:
- More complex 太复杂了很难搞清楚其原理
调用函数时的建议
软件库有的函数,直接调用而不是自己写。
例子
( heta = left[ egin{matrix} heta_1 \ heta_2 end{matrix} ight])
(J( heta) = ( heta_1 - 5)^2 + ( heta_2 - 5)^2)
(frac{partial}{partial heta_1}J( heta) = 2( heta_1 - 5))
(frac{partial}{partial heta_2}J( heta) = 2( heta_2 - 5))
编写函数:
function [jVal, gradient] = costFunction(theta)
jVal = (theta(1) - 5) ^2 + (theta(2) - 5)^2;
gradient = zeros(2, 1);
gradient(1) = 2 * (theta(1) - 5);
gradient(2) = 2 * (theta(2) - 5);
运行代码:
fminunc
函数是内置的高级优化函数,它表示Octave里无约束最小化函数。具体用法如下:设置几个options,这些options变量作为一个数据结构可以存储你想要的options。GradObj和on是设置梯度目标参数为打开,这意味着你现在确实要给这个算法提供一个梯度,然后设置最大迭代次数,下面的例子中设置的次数为100。
@
符号表示刚刚定义的costFunction
函数的指针。它会自动选择学习速率(alpha),然后尝试使用这些高级的优化算法,就像加强版的梯度下降法,为你找到最佳的( heta)值。
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2, 1)
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options)
functionVal是函数最后的值,exitFlag表示函数是否已经收敛。
欲要了解更过可通过help
函数。
Multiclass Classification: One-vs-all
Multiclass classification
多分类问题举例:
-
Email foldering/tagging: Work, Friends, Family, Hobby.
假如你现在需要一个学习算法,可以自动地将邮件归类到不同文件夹里或者自动加上标签。
-
Medical diagrams: Not ill, Cold, Flu.
如果一个病人因为鼻塞来找你诊断,他可能并没生病,或者感冒了,或者得了流感。
-
Weather: Sunny, Coludy, Rain, Snow.
你正在做天气的机器学习分类问题,想要区分天气是晴天、多云、下雨天还是下雪天。
One-vs-all 一对多的方法
多分类问题图示:
分类方法:制造新的“伪”训练集。
以三角形为例,将其定为正类,另外两种定为负类,我们创建一个新的训练集。接着拟合出一个合适的分类器,可记为(h_ heta^{(1)}(x))。
接着将正方形定为正类,另外两种定为负类……便可得到(h_ heta^{(2)}(x))、(h_ heta^{(3)}(x))。
Train a logistic regression classifier (h_ heta^{(i)}(x)) for each class (i) to predict the probablity that (y = i). 对每一个可能的(y = i)都训练出一个逻辑回归分类器(h_ heta^{(i)}(x))。
On a new input (x), to make a prediction, pick the class (i) that maximizes (max_ih_ heta^{(i)}(x)). 对于给出的(x)值,我们在我们得到的分类器里分别输入(x)值,然后选择一个让(h)最大的(i)。
Review
测验
-
Suppose that you have trained a logistic regression classifier, and it outputs on a new example (x) a prediction (h_ heta(x)) = 0.7. This means (check all that apply):
- [x] Our estimate for (P(y=1|x; heta)) is 0.7.
- [x] Our estimate for (P(y=0|x; heta)) is 0.3.
- [ ] Our estimate for (P(y=0|x; heta)) is 0.7.
- [ ] Our estimate for (P(y=1|x; heta)) is 0.3.
-
Suppose you have the following training set, and fit a logistic regression classifier (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2x_2)). Which of the following are true? Check all that apply.
- [x] Adding polynomial features (e.g., instead using (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2 x_2 + heta_3 x_1^2 + heta_4 x_1 x_2 + heta_5 x_2^2))) could increase how well we can fit the training data. 这种数据线性回归并不适用,适当的增加多项式特性,可以提高对数据的适应。
- [x] At the optimal value of ( heta) (e.g., found by fminunc), we will have (J( heta) geq 0).
- [ ] Adding polynomial features (e.g., instead using (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2 x_2 + heta_3 x_1^2 + heta_4 x_1 x_2 + heta_5 x_2^2)) ) would increase (J( heta)) because we are now summing over more terms. 将会减小(J( heta))。
- [ ] If we train gradient descent for enough iterations, for some examples (x^{(i)}) in the training set it is possible to obtain (h_ heta(x^{(i)}) > 1). (0 lt h_ heta(x^{(i)}) lt 1)。
- [x] (J( heta)) will be a convex function, so gradient descent should converge to the global minimum.
- [ ] The positive and negative examples cannot be separated using a straight line. So, gradient descent will fail to converge.
- [ ] Because the positive and negative examples cannot be separated using a straight line, linear regression will perform as well as logistic regression on this data.
-
For logistic regression, the gradient is given by (frac{partial}{partial heta_j}J( heta) = frac{1}{m} sum^m_{i=1}(h_ heta(x^{(i)})-y^{(i)})x_j^{(i)}). Which of these is a correct gradient descent update for logistic regression with a learning rate of (alpha)? Check all that apply.
- [x] ( heta_j := heta_j - alphafrac{1}{m}sum^m_{i=1}(frac{1}{1+e^{- heta^Tx^{(i)}}}-y^{(i)})x_j^{(i)}) (simultaneously update for all (j)).
- [x] ( heta_j := heta_j - alphafrac{1}{m}sum^m_{i=1}(h_ heta(x^{(i)}) - y^{(i)})x_j^{(i)}) (simultaneously update for all (j)).
- [ ] ( heta := heta - alphafrac{1}{m}sum^m_{i=1}( heta^Tx - y^{(i)})x^{(i)}). 线性回归
- [ ] ( heta_j := heta_j - alphafrac{1}{m}sum^m_{i=1}(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}) (simultaneously update for all (j)).
-
Which of the following statements are true? Check all that apply.
- [x] The one-vs-all technique allows you to use logistic regression for problems in which each (y^{(i)}) comes from a fixed, discrete set of values. 将要分类的定为正类,其它定为负类。
- [x] The cost function (J( heta)) for logistic regression trained with (m geq 1) examples is always greater than or equal to zero. (J( heta) ge 0)。
- [ ] For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc). 高级优化算法优点:不用挑选学习速率(alpha),通常运行较快。
- [ ] Since we train one classifier when there are two classes, we train two classifiers when there are three classes (and we do one-vs-all classification). 3个类时要训练3个分类器。
- [x] The sigmoid function (g(z)=frac{1}{1+e^{−z}}) is never greater than one (>1). sigmoid函数的取值范围是(0,1)。
- [ ] Linear regression always works well for classification if you classify by using a threshold on the prediction made by linear regression. 分类问题,要么0, 要么1, 没有什么threshold一说。
-
Suppose you train a logistic classifier (h_ heta(x) = g( heta_0 + heta_1x_1 + heta_2 x_2)). Suppose ( heta_0 = 6, heta_1 = -1, heta_2 = 0). Which of the following figures represents the decision boundary found by your classifier?
- [x]
- [ ]
- [ ]
- [ ]
- [x]