Softmax Regression - 走看看

zoukankan html css js c++ java

Softmax Regression

This model generalizes logistic regression to classification problems where the class label y    can take on more than two possible values.

Softmax regression is a supervised learning algorithm, but we will later be using it in       conjuction with our deep learning/unsupervised feature learning methods.

With logistic regression, we were in the binary classification setting, so the labels were $y^{(i)} in {0,1}$ . Our hypothesis took the form:

$egin{align} h_ heta(x) = frac{1}{1+exp(- heta^Tx)}, end{align}$

and the model parameters θ were trained to minimize the cost function

$egin{align} J( heta) = -frac{1}{m} left[ sum_{i=1}^m y^{(i)} log h_ heta(x^{(i)}) + (1-y^{(i)}) log (1-h_ heta(x^{(i)})) ight] end{align}$

In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label y can take on k different values, rather than only two.

Given a test input x, we want our hypothesis to estimate the probability that p(y = j | x) for each value of $j = 1, ldots, k$ . I.e., we want to estimate the probability of the class label taking on each of the k different possible values. Thus, our hypothesis will output a k dimensional vector (whose elements sum to 1) giving us our k estimated probabilities. Concretely, our hypothesis h_θ(x) takes the form:

$egin{align} h_ heta(x^{(i)}) = egin{bmatrix} p(y^{(i)} = 1 | x^{(i)}; heta) \ p(y^{(i)} = 2 | x^{(i)}; heta) \ vdots \ p(y^{(i)} = k | x^{(i)}; heta) end{bmatrix} = frac{1}{ sum_{j=1}^{k}{e^{ heta_j^T x^{(i)} }} } egin{bmatrix} e^{ heta_1^T x^{(i)} } \ e^{ heta_2^T x^{(i)} } \ vdots \ e^{ heta_k^T x^{(i)} } \ end{bmatrix} end{align}$

Here $heta_1, heta_2, ldots, heta_k in Re^{n+1}$ are the parameters of our model. Notice that the term $frac{1}{ sum_{j=1}^{k}{e^{ heta_j^T x^{(i)} }} }$ normalizes the distribution, so that it sums to one.

For convenience, we will also write θ to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent θ as a k-by-(n + 1) matrix obtained by stacking up $heta_1, heta_2, ldots, heta_k$ in rows, so that

$heta = egin{bmatrix} mbox{---} heta_1^T mbox{---} \ mbox{---} heta_2^T mbox{---} \ vdots \ mbox{---} heta_k^T mbox{---} \ end{bmatrix}$    theta 为k*（n+1）维的矩阵

Cost Function

Our cost function will be:

$egin{align} J( heta) = - frac{1}{m} left[ sum_{i=1}^{m} sum_{j=1}^{k} 1left{y^{(i)} = j ight} log frac{e^{ heta_j^T x^{(i)}}}{sum_{l=1}^k e^{ heta_l^T x^{(i)} }} ight] end{align}$

Notice that this generalizes the logistic regression cost function, which could also have been written:

$egin{align} J( heta) &= -frac{1}{m} left[ sum_{i=1}^m (1-y^{(i)}) log (1-h_ heta(x^{(i)})) + y^{(i)} log h_ heta(x^{(i)}) ight] \ &= - frac{1}{m} left[ sum_{i=1}^{m} sum_{j=0}^{1} 1left{y^{(i)} = j ight} log p(y^{(i)} = j | x^{(i)} ; heta) ight] end{align}$

$1{cdot}$ is the indicator function, so that 1{a true statement} = 1, and 1{a false statement} = 0.

The softmax cost function is similar, except that we now sum over the k different possible values of the class label. Note also that in softmax regression, we have that $p(y^{(i)} = j | x^{(i)} ; heta) = frac{e^{ heta_j^T x^{(i)}}}{sum_{l=1}^k e^{ heta_l^T x^{(i)}} }$ .

Properties of softmax regression parameterization

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, suppose we take each of our parameter vectors θ_j, and subtract some fixed vector ψ from it, so that every θ_j is now replaced withθ_j − ψ (for every $j=1, ldots, k$ ). Our hypothesis now estimates the class label probabilities as

$egin{align} p(y^{(i)} = j | x^{(i)} ; heta) &= frac{e^{( heta_j-psi)^T x^{(i)}}}{sum_{l=1}^k e^{ ( heta_l-psi)^T x^{(i)}}} \ &= frac{e^{ heta_j^T x^{(i)}} e^{-psi^Tx^{(i)}}}{sum_{l=1}^k e^{ heta_l^T x^{(i)}} e^{-psi^Tx^{(i)}}} \ &= frac{e^{ heta_j^T x^{(i)}}}{sum_{l=1}^k e^{ heta_l^T x^{(i)}}}. end{align}$

In other words, subtracting ψ from every θ_j does not affect our hypothesis' predictions at all! This shows that softmax regression's parameters are "redundant." More formally, we say that our softmax model is overparameterized, meaning that for any hypothesis we might fit to the data, there are multiple parameter settings that give rise to exactly the same hypothesis function h_θ mapping from inputs x to the predictions.

Further, if the cost function J(θ) is minimized by some setting of the parameters $( heta_1, heta_2,ldots, heta_k)$ , then it is also minimized by $( heta_1 - psi, heta_2 - psi,ldots, heta_k - psi)$ for any value of ψ. Thus, the minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton's method to run into numerical problems.)

Notice also that by setting ψ = θ₁, one can always replace θ₁ with $heta_1 - psi = vec{0}$ (the vector of all 0's), without affecting the hypothesis. Thus, one could "eliminate" the vector of parameters θ₁ (or any other θ_j, for any single value of j), without harming the representational power of our hypothesis. Indeed, rather than optimizing over the k(n + 1) parameters $( heta_1, heta_2,ldots, heta_k)$ (where $heta_j in Re^{n+1}$ ), one could instead set $heta_1 = vec{0}$ and optimize only with respect to the (k − 1)(n + 1)remaining parameters, and this would work fine. 降一维

In practice, however, it is often cleaner and simpler to implement the version which keeps all the parameters $( heta_1, heta_2,ldots, heta_n)$ , without arbitrarily setting one of them to zero. But we will make one change to the cost function: Adding weight decay. This will take care of the numerical problems associated with softmax regression's overparameterized representation.

Weight Decay

We will modify the cost function by adding a weight decay term $extstyle frac{lambda}{2} sum_{i=1}^k sum_{j=0}^{n} heta_{ij}^2$ which penalizes large values of the parameters. Our cost function is now

$egin{align} J( heta) = - frac{1}{m} left[ sum_{i=1}^{m} sum_{j=1}^{k} 1left{y^{(i)} = j ight} log frac{e^{ heta_j^T x^{(i)}}}{sum_{l=1}^k e^{ heta_l^T x^{(i)} }} ight] + frac{lambda}{2} sum_{i=1}^k sum_{j=0}^n heta_{ij}^2 end{align}$

With this weight decay term (for any λ > 0), the cost function J(θ) is now strictly convex, and is guaranteed to have a unique solution. The Hessian is now invertible, and because J(θ) is convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed to converge to the global minimum.

To apply an optimization algorithm, we also need the derivative of this new definition of J(θ). One can show that the derivative is: $egin{align} abla_{ heta_j} J( heta) = - frac{1}{m} sum_{i=1}^{m}{ left[ x^{(i)} ( 1{ y^{(i)} = j} - p(y^{(i)} = j | x^{(i)}; heta) ) ight] } + lambda heta_j end{align}$

Relationship to Logistic Regression

In the special case where k = 2, one can show that softmax regression reduces to logistic regression. This shows that softmax regression is a generalization of logistic regression. Concretely, when k = 2, the softmax regression hypothesis outputs

$egin{align} h_ heta(x) &= frac{1}{ e^{ heta_1^Tx} + e^{ heta_2^T x^{(i)} } } egin{bmatrix} e^{ heta_1^T x } \ e^{ heta_2^T x } end{bmatrix} end{align}$

Taking advantage of the fact that this hypothesis is overparameterized and setting ψ = θ₁, we can subtract θ₁ from each of the two parameters, giving us

$egin{align} h(x) &= frac{1}{ e^{vec{0}^Tx} + e^{ ( heta_2- heta_1)^T x^{(i)} } } egin{bmatrix} e^{ vec{0}^T x } \ e^{ ( heta_2- heta_1)^T x } end{bmatrix} \ &= egin{bmatrix} frac{1}{ 1 + e^{ ( heta_2- heta_1)^T x^{(i)} } } \ frac{e^{ ( heta_2- heta_1)^T x }}{ 1 + e^{ ( heta_2- heta_1)^T x^{(i)} } } end{bmatrix} \ &= egin{bmatrix} frac{1}{ 1 + e^{ ( heta_2- heta_1)^T x^{(i)} } } \ 1 - frac{1}{ 1 + e^{ ( heta_2- heta_1)^T x^{(i)} } } \ end{bmatrix} end{align}$

Thus, replacing θ₂ − θ₁ with a single parameter vector θ', we find that softmax regression predicts the probability of one of the classes as $frac{1}{ 1 + e^{ ( heta')^T x^{(i)} } }$ , and that of the other class as $1 - frac{1}{ 1 + e^{ ( heta')^T x^{(i)} } }$ , same as logistic regression.

Softmax Regression vs. k Binary Classifiers

Suppose you are working on a music classification application, and there are k types of music that you are trying to recognize. Should you use a softmax classifier, or should you build k separate binary classifiers using logistic regression?

This will depend on whether the four classes are mutually exclusive. For example, if your four classes are classical, country, rock, and jazz, then assuming each of your training examples is labeled with exactly one of these four class labels, you should build a softmax classifier with k = 4. (If there're also some examples that are none of the above four classes, then you can set k = 5 in softmax regression, and also have a fifth, "none of the above," class.) 相互独立 —— softmax

If however your categories are has_vocals, dance, soundtrack, pop, then the classes are not mutually exclusive; for example, there can be a piece of pop music that comes from a soundtrack and in addition has vocals. In this case, it would be more appropriate to build 4 binary logistic regression classifiers. This way, for each new musical piece, your algorithm can separately decide whether it falls into each of the four categories. 不相互独立（之间有交集） —— k binary logistic regression classifiers

Now, consider a computer vision example, where you're trying to classify images into three different classes. (i) Suppose that your classes are indoor_scene, outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression or three logistic regression classifiers? (ii) Now suppose your classes are indoor_scene, black_and_white_image, and image_has_people. Would you use softmax regression or multiple logistic regression classifiers?

In the first case, the classes are mutually exclusive, so a softmax regression classifier would be appropriate. In the second case, it would be more appropriate to build three separate logistic regression classifiers.

查看全文

相关阅读:
iOS
iOS NSNumber语法糖
 iOS 计算两个日期之间的天数问题
 iOS 获取当前媒体音量
 IAP (In-App Purchase)中文文档
 iOS zipzap读取压缩文件
 OC的内存管理机制
 OC 消息传递机制
 OS X环境下SVN回滚工程到指定版本，回滚指定文件到指定版本
 iOS 图片按比例压缩，指定大小压缩

原文地址：https://www.cnblogs.com/sprint1989/p/3974364.html

Copyright © 2011-2022 走看看