1. advantage: when number of features is too large, so previous algorithm is not a good way to learn complex nonlinear hypotheses.
2. representation
"activation" of unit i in layer j
matrix of weights controlling function mapping from layer j to layer j+1
3. sample
we have the neural expressions
if network has sj units in layer j, sj+1 units in layer j+1, then θ(j) will be of dimension sj+1 * (sj + 1).
4. forward propagation:
add
5. cost function
L: total no. of layers in network
s_l: no. of units(not counting bias unit) in layer l
6. gradient computation
need code to compute:
backpropagation algorithm:
sample network:
Pace:
7. gradient checking
8. random initialization
9. sum.