Suppose we are going to optimize a parameterized function (J( heta)), where ( heta in mathbb{R}^d), for example, ( heta) could be a neural net.
More specifically, we want to (mbox{ minimize } J( heta; mathcal{D})) on dataset (mathcal{D}), where each point in (mathcal{D}) is a pair ((x_i, y_i)).
There are different ways to apply gradient descent.
Let (eta) be the learning rate
.
- Vanilla batch update
( heta gets heta - eta abla J( heta; mathcal{D}))
Note that ( abla J( heta; mathcal{D})) computes the gradient on of the whole dataset (mathcal{D}).
for i in range(n_epochs):
gradient = compute_gradient(J, theta, D)
theta = theta - eta * gradient
eta = eta * 0.95
It is obvious that when (mathcal{D}) is too large, this approach is unfeasible.
- Stochastic Gradient Descent
Stochastic Gradient, on the other hand, update the parameters example by example.
( heta gets heta - eta *J( heta, x_i, y_i)), where ((x_i, y_i) in mathcal{D}).
for n in range(n_epochs):
for x_i, y_i in D:
gradient=compute_gradient(J, theta, x_i, y_i)
theta = theta - eta * gradient
eta = eta * 0.95
- Mini-batch Stochastic Gradient Descent
Update ( heta) example by example could lead to high variance, the alternative approach is to update ( heta) by mini-batches (M) where (|M| ll |mathcal{D}|).
for n in range(n_epochs):
for M in D:
gradient = compute_gradient(J, M)
theta = theta - eta * gradient
eta = eta * 0.95
Question? Why decaying the learning rate
leads to convergence?
why (sum_{i=1}^{infty} eta_i = infty) and (sum_{i=1}^{infty} eta_i^2 < infty) is the condition for convergence? Based on what assumption of (J( heta))?