8. 机器学习第三周（3）

zoukankan html css js c++ java

8. 机器学习第三周（3）
Logistic Regression Model

1. Cost Function

We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.

Instead, our cost function for logistic regression looks like:
$$
egin{align}& J( heta) = dfrac{1}{m} sum_{i=1}^m mathrm{Cost}(h_ heta(x^{(i)}),y{(i)}) ewline & mathrm{Cost}(h_ heta(x),y) = -log(h_ heta(x)) ; & ext{if y = 1} ewline & mathrm{Cost}(h_ heta(x),y) = -log(1-h_ heta(x)) ; & ext{if y = 0}end{align}
$$

When y = 1, we get the following plot for J(θ) vs $$h_{ heta}(x)$$:

Similarly, when y = 0, we get the following plot for J(θ) vs $$h_{ heta}(x)$$::

$$
egin{align}& mathrm{Cost}(h_ heta(x),y) = 0 ext{ if } h_ heta(x) = y ewline & mathrm{Cost}(h_ heta(x),y) ightarrow infty ext{ if } y = 0 ; mathrm{and} ; h_ heta(x) ightarrow 1 ewline & mathrm{Cost}(h_ heta(x),y) ightarrow infty ext{ if } y = 1 ; mathrm{and} ; h_ heta(x) ightarrow 0 ewline end{align}
$$
If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.

If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.

Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

2.Simplified Cost Function and Gradient Descent

We can compress our cost function's two conditional cases into one case:

$$
mathrm{Cost}(h_ heta(x),y) = - y ; log(h_ heta(x)) - (1 - y) log(1 - h_ heta(x))
$$
Notice that when y is equal to 1, then the second term $$(1-y)log(1-h_ heta(x))$$ will be zero and will not affect the result. If y is equal to 0, then the first term$$-y log(h_ heta(x))$$ will be zero and will not affect the result.

We can fully write out our entire cost function as follows:
$$J( heta) = - frac{1}{m} displaystyle sum_{i=1}^m [y^{(i)}log (h_ heta (x^{(i)})) + (1 - y^{(i)})log (1 - h_ heta(x^{(i)}))]$$

A vectorized implementation is:
$$egin{align} & h = g(X heta) ewline & J( heta) = frac{1}{m} cdot left(-y^{{T}log(h)-(1-y)}{T}log(1-h) ight) end{align}$$
Gradient Descent

Remember that the general form of gradient descent is:
$$egin{align}& Repeat ; lbrace ewline & ; heta_j := heta_j - alpha dfrac{partial}{partial heta_j}J( heta) ewline & braceend{align}$$

We can work out the derivative part using calculus to get:
$$egin{align} & Repeat ; lbrace ewline & ; heta_j := heta_j - frac{alpha}{m} sum_{i=1}^m (h_ heta(x^{(i)}) - y^{(i)}) x_j^{(i)} ewline & brace end{align}$$

Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta.

A vectorized implementation is:

$$ heta := heta - frac{alpha}{m} X^{T} (g(X heta ) - vec{y})$$

3. Advanced Optimization

"Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.

We first need to provide a function that evaluates the following two functions for a given input value θ:

$$egin{align} & J( heta) ewline & dfrac{partial}{partial heta_j}J( heta)end{align}$$

We can write a single function that returns both of these:
```
function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end
```
Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()". (Note: the value for MaxIter should be an integer, not a character string - errata in the video at 7:30)
```
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
```
We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.

4. Multiclass Classification: One-vs-all

Now we will approach the classification of data when we have more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1...n}.

Since y = {0,1...n}, we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that 'y' is a member of one of our classes.
$$egin{align}& y in lbrace0, 1 ... n brace ewline& h_ heta^{(0)}(x) = P(y = 0 | x ; heta) ewline& h_ heta^{(1)}(x) = P(y = 1 | x ; heta) ewline& cdots ewline& h_ heta^{(n)}(x) = P(y = n | x ; heta) ewline& mathrm{prediction} = max_i( h_ heta ^{(i)}(x) ) ewlineend{align}$$

We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.

The following image shows how one could classify 3 classes:

To summarize:

Train a logistic regression classifier $$h_{ heta}(x)$$ for each class to predict the probability that y = i .

To make a prediction on a new x, pick the class that maximizes$$h_{ heta}(x)$$
查看全文

相关阅读:
使用 Istio 进行 JWT 身份验证（充当 API 网关）
DNS 私有域的选择：internal.xxx.com/lan.xxx.com 还是 xxx.local/xxx.srv？
「Bug」K8s 节点的 IP 地址泄漏，导致 IP 被耗尽
 Linux网络学习笔记（二）：域名解析(DNS)——以 CoreDNS 为例
 Linux 发行版的选用（服务器和个人桌面）
「Bug」VMware 虚拟机的关机测试中，Ubuntu 明显比 CentOS 慢
 VMware vSphere ：服务器虚拟化
 「Bug」ubuntu 使用国内 apt 源构建 docker 时提示 hash 不匹配
 留言板
 Idea 自定义快捷代码输入如syso => System.out.println()

原文地址：https://www.cnblogs.com/flysevenwu/p/6197640.html

8. 机器学习第三周（3）

Logistic Regression Model

1. Cost Function

2.Simplified Cost Function and Gradient Descent

3. Advanced Optimization

4. Multiclass Classification: One-vs-all