1. Variable definitions
m
: training examples' count
(X) : design matrix. each row of (X) is a training example, each column of (X) is a feature
[X =
egin{pmatrix}
1 & x^{(1)}_1 & ... & x^{(1)}_n \
1 & x^{(2)}_1 & ... & x^{(2)}_n \
... & ... & ... & ... \
1 & x^{(n)}_1 & ... & x^{(n)}_n \
end{pmatrix}]
[ heta =
egin{pmatrix}
heta_0 \
heta_1 \
... \
heta_n \
end{pmatrix}]
2. Hypothesis
[x=
egin{pmatrix}
x_0 \
x_1 \
... \
x_n \
end{pmatrix}
]
[h_ heta(x) = g( heta^T x) = g(x_0 heta_0 + x_1 heta_1 + ... + x_n heta_n) = frac{1}{1 + e^{(- heta^Tx)}},
]
sigmoid function
[g(z) = frac{1}{1 + e^{-z}},
]
g = 1 ./ (1 + e .^ (-z));
3. Cost function
[J( heta) = frac{1}{m}sum_{i=1}^m[-y^{(i)}log(h_ heta(x^{(i)})) - (1-y^{(i)})log(1 - h_ heta(x^{(i)}))],
]
vectorization edition of Octave
J = -(1 / m) * sum(y' * log(sigmoid(X * theta)) + (1 - y)' * log(1 - sigmoid(X * theta)));
4. Goal
find ( heta) to minimize (J( heta)), ( heta) is a vector here
4.1 Gradient descent
[frac{partial J( heta)}{partial heta_j} = frac{1}{m} sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}_j,
]
repeat until convergence{
( heta_j := heta_j - alpha sum_{i=1}^m (h_ heta(x^{(i)}) - y^{(i)}) x^{(i)}_j)
}
vectorization
(S)
[=
egin{pmatrix}
h_ heta(x^{(1)})-y^{(1)} & h_ heta(x^{(2)})-y^{(2)} & ... & h_ heta(x^{(n)}-y^{(n)})
end{pmatrix}
egin{pmatrix}
x^{(1)}_0 & x^{(1)}_1 & ... & x^{(1)}_3 \
x^{(2)}_0 & x^{(2)}_1 & ... & x^{(2)}_3 \
... & ... & ... & ... \
x^{(n)}_0 & x^{(n)}_1 & ... & x^{(n)}_3 \
end{pmatrix}
]
[=
egin{pmatrix}
sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}_0 &
sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}_1 &
... &
sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}_n
end{pmatrix}
]
[ heta = heta - S^T
]
[h_ heta(X) = g(X heta) = frac{1}{1 + e^{(-X heta)}}
]
(X heta) is nx1, (y) is nx1
(frac{1}{1+e^{(-X heta)}} - y) is nx1
[frac{1}{1 + e^{(-X heta)}} - y=
egin{pmatrix}
h_ heta(x^{(1)})-y^{(1)} & h_ heta(x^{(2)})-y^{(2)} & ... & h_ heta(x^{(n)})-y^{(n)}
end{pmatrix}
]
[ heta = heta - alpha(frac{1}{1 + e^{(-X heta)}} - y)X
]
5. Regularized logistic regression
to avoid overfitting or underfitting
Cost function
[J( heta) = frac{1}{m}sum_{i=1}^m[-y^{(i)}log(h_ heta(x^{(i)})) - (1-y^{(i)})log(1 - h_ heta(x^{(i)}))] + frac{lambda}{2m} sum_{j=1}^m heta^2_j,
]
Gradient descent
[frac{partial J( heta)}{partial heta_0} = frac{1}{m} sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}_0,
]
[frac{partial J( heta)}{partial heta_j} = frac{1}{m} sum_{i=1}^m(h_ heta(x^{(i)}) - y^{(i)})x^{(i)}_j, (j ge 1)
]