Neural Networks
The ‘one learning algorithm’ hypothesis
- Neuron-rewiring experiments
Model Representation
Define
- Sigmoid(logistic) activation function
- bias unit
- input layer
- output layer
- hidden layer
- (a_i^{(j)}) : ‘activation’ of unit (i) in layer (j)
- ( heta^{(j)}): matrix of weights controlling function mapping from layer (j) to layer (j + 1).
Calculate
[a^{(j)} = g(z^{(j)})]
[g(x) = frac{1}{1 + e^{-x}}]
[z^{(j + 1)} = Theta^{(j)}a^{(j)}]
[h_ heta(x) = a^{(j + 1)} = g(z^{(j + 1)})]
Cost Function
[
J(Theta) = - frac{1}{m} sum_{i=1}^m sum_{k=1}^K left[y^{(i)}_k log ((h_Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)log (1 - (h_Theta(x^{(i)}))_k)
ight] + ]
[frac{lambda}{2m}sum_{l=1}^{L-1} sum_{i=1}^{s_l} sum_{j=1}^{s_{l+1}} ( Theta_{j,i}^{(l)})^2
]
Back-propagation Algorithm
Algorithm
- Hypothesis we have calculated all the (a^{(l)}) and (z^{(l)})
- set (Delta^{(l)}_{i, j} := 0) for all (l, i, j)
- using (y^{(t)}), compute (delta^{L} = a^{(L)} - y^{(t)}), where (y^{(t)}_{k}(i) in {0, 1}) indicates whether the current training example belongs to class k{(y^{(t)}_{k}(k) = 1)}, or if it belongs to a different class = 0;
- For the hidden layer (l = L - 1) down to 2, set
[
delta^{(l)} = (Theta^{(l)})^Tdelta^{(l + 1)} .* g’(z^{(l)})
] - remember remove (delta_0^{(l)}) by.
delta(2:end)
[
Delta^{(l)} = Delta^{(l)} + delta^{(l + 1)}(a^{(l)})^T
] - gradient
[
frac{partial}{partialTheta^{(l)}_{i,j}}J(Theta) = D^{(l)}_{i,j} = frac{1}{m}Delta^{(l)}_{i,j} +
egin{cases} frac{lambda}{m}Theta^{(l)}_{i, j}, & ext {if j $geq$ 1} \ 0, & ext{if j = 0} end{cases}
]
Gradient Checking
- [
frac{d}{dTheta}J(Theta) approx frac{J(Theta + epsilon) - J(Theta - epsilon)}{2epsilon}
] - A small value for (epsilon) such as (epsilon = 10^{-4})
- check that gradApprox (approx) deltalVector
4.
epsilon = 1e-4;
for i = 1 : n
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2 * epsilon);
end;
Rolling and Unrolling
Random Initialization
Theta = rand(n, m)) * (2 * INIT_EPSILON) - INIT_EPSILON;
- initialize ( Theta^{(l)}_{ij} in [-epsilon, epsilon] )
- else if we initializing all theta weights to zero, all nodes will update to the same value repeatedly when we back_propagate.
- One effective strategy for choosing (epsilon_{init}) is to base the number of units in the network. A good choice of (epsilon_{init}) is (epsilon_{init} = frac{sqrt{6}}{sqrt{L_{in} + L_{out}}} )
Training a Neural Network
- Randomly initialize weights
Theta = rand(n, m) * (2 * epsilon) - epsilon;
- Implement forward propagation to get (h_Theta(x^{(i)})) for any (x^{(i)})
- Implement code to compute cost function (J(Theta))
Implement back-prop to compute partial derivatives ( frac{d(JTheta)}{dTheta_{jk}^{(l)}} )
- ( g’(z) = frac{d}{dz}g(z) = g(z)(1 - g(z)))
- ( sigmoid(z) = g(z) = frac{1}{1 + e^{-z}})
Use gradient checking to compare ( frac{d(JTheta)}{dTheta_{jk}^{(l)}} ) computed using back-propagation vs. using numerical estimate of gradient of (J(Theta))
Then disable gradient checking codeUse gradient descent or advanced optimization method with back-propagation to try to minimize (J(Theta)) as a function of parameters (Theta)