这篇论文最早是一篇2016年1月16日发表在Sebastian Ruder的博客。本文主要工作是对这篇论文与李宏毅课程相关的核心部分进行翻译。
An overview of gradient descent optimization algorithms
0. Abstract 摘要:
Gradient descent optimization algorithms, while increasingly popular, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.
This article aims to provide the reader with intuitions with regard to the behaviour of different algorithms that will allow her to put them to use.
In the course of this overview, we look at different variants of gradient descent, summarize challenges, introduce the most common optimization algorithms, review architectures in a parallel and distributed setting, and investigate additional strategies for optimizing gradient descent.
1. Introduction 引言:
Gradient Descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks.
At the same time, every state-of-art Deep Learning library contains implementations of various algorithms to optimize gradient descent(e.g. lasagne's, caffe's, and keras' documentation).
These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesss are hard to come by.
This article aims at providing the reader with intuitions with regard to the behaviour of different algorithms for optimizing gradient descent that will help her to put them to use.
In section 2, we are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training in Section 3.
Subsequently, in Section 4, we will introduce the most common optimization algorithms by showing their motivation to resolve there challenges and how this leads to the derivation of their update rules.
Afterwards, in Section 5, we will take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting.
Finally, we will consider additional strategies that are helpful for optimizing gradient descent in Section 6.
Gradient descent is a way to minimize an objective function (J( heta)) parameterized by a model's parameters ( heta in R^d) by updating the parameters in the opposite direction of the gradient of the objective function ({
abla}_{ heta} J({ heta})) w.r.t. to the parameters.
梯度下降方法就是对于目标函数 (J( heta)),计算梯度 ({
abla}_{ heta} J({ heta})) ,并负向更新参数 ( heta in R^d),使得目标函数最小。
The learning rate (eta) determines the size of the steps we take to reach a (local) minimum.
学习率 (eta) 确定了我们逼近(局部)最小值的步长。
In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
2. Gradient descent variants 梯度下降的变体
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function.
Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
2.1 Batch gradient descent 批量梯度下降
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters ( heta) for the entire training dataset:
Vanilla梯度下降,也叫作批量梯度下降,通过整个训练数据集,计算损失函数关于参数 ( heta) 的梯度:
( heta = heta - eta · {
abla}_{ heta} J ({ heta})) ---- (1)
As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory.
Batch gradient descent also does not allow us to update our model online, i.e. with new examples on-the-fly.
In code, batch gradient descent looks something like this:
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
For a pre-defined number of epochs, we first compute the gradient vector params_grad of the loss function for the whole dataset w.r.t. our parameter vector params.
对于一个给定的迭代次数epochs,我们首先利用整个数据集计算关于参数向量 params 的损失函数 param_grad 的梯度。
Note that state-of-the-art deep learning libraries provide automatic differentiation that efficiently computes the gradient w.r.t. some parameters.
If you derive the gradients yourself, then gradient checking is a good idea.
We then update our parameters in the direction of the gradients with the learning rate determining how big of an update we perform.
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
2.2 Stochastic gradient descent 批量梯度下降
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example (x^{(i)}) and label (y^{(i)}) :
相对而言,随机梯度下降算法(SGD)是对其中一个训练样本((x^{(i)}, y^{(i)}))求梯度并更新参数:
( heta = heta - eta · {
abla}_{ heta} J ({ heta; x^{(i)}, y^{(i)}})) ---- (2)
Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update.
SGD does away with this redundancy by performing one update at a time.
It is therefore usually much faster and can also be used to learn online.
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily as in Figure 1.
While batch gradient descent converges to the minimum of the basin the parameters are placed in, SGD's fluctuation, on the one hand, enables it to jump to new and potentially better local minima.
On the other hand, this ultimately complicates convergence to the exact minimum, as SGD will keep overshooting.
However, it has been shown that when we slowly decrease the learning rate, SGD shows the same convergence behaviour as batch gradient descent, almost certainly converging to a local or the global minimum for non-convex and convex optimization respectively.
Its code fragment simply adds a loop over the training examples and evaluates the gradient w.r.t. each example.
Note that we shuffle the training data at every epoch as explained in Section 6.1.
for i in range(nb_epochs):
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
2.3 Mini-batch gradient descent 小批量梯度下降
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of (n) training examples:
( heta = heta - eta · {
abla}_{ heta} J ({ heta}; x^{i:i+n}; y^{i:i+n})) ---- (3)
This way, it a) reduces the variance of the parameter updates, which can lead to more stable convergence;
and b) can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.
Common mini-batch sizes range between 50 and 256, but can vary for different applications.
Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.
Note: In modifications of SGD in the rest of this post, we leave out the parameters (x^{(i:i+n)}; y^{(i:i+n)}) for simplicity.
注意:为了简便起见,下文对于SGD的改进中我们省略了(x^{(i:i+n)}; y^{(i:i+n)})参数。
In code, instead of iterating over examples, we now iterate over mini-batches of size 50:
for i in range(nb_epochs):
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad