Optimizers¶
Critical to any deep learning model is the optimization of weights to perform the given task. This set of classes provide options for selecting and customizing the appropriate optimization algorithm. Neon supports the following optimizers:
Function | Description |
---|---|
neon.optimizers.GradientDescentMomentum |
Stochastic gradient descent with momentum |
neon.optimizers.RMSProp |
Root Mean Square propagation (see Hinton’s slides) |
neon.optimizers.Adagrad |
Adagrad method for adapting the learning rate |
neon.optimizers.Adadelta |
Adadelta method for adapting the learning rate |
neon.optimizers.Adam |
Adam optimization algorithm |
neon.optimizers.MultiOptimizer |
Class for assigning optimizers to different layers |
Each optimization algorithm inherits from the
neon.optimizers.Optimizer
class and implements the optimize
method:
"""
Given the model's layers and the current training epoch,
iterate over each layer and update the weights.
"""
def optimize(self, layer_list, epoch):
The neon.optimizers.Optimizer
base class also implements two methods:
clip_gradient_value
, which clips each gradient between \(-k\) and \(k\), where \(k\) is the argumentgradient_clip_value
.clip_gradient_norm
, which scales each gradient by \(k\), which is the argumentgradient_clip_norm
.
Stochastic Gradient Descent¶
Stochastic Gradient Descent (SGD) has existed for awhile, but its usefulness in training deep neural networks has only recently been realized. SGD is similar to traditional gradient descent, except that the gradient updates are computed over a small subset of the total training data (i.e., a minibatch).
Given the parameters \(\theta\), the learning rate \(\alpha\), and the gradients \(\nabla J(\theta; x)\) computed on the minibatch data \(x\), SGD updates the parameters via
Here we implement SGD with momentum. Momentum tracks the history of gradient updates to help the system move faster through saddle points. Given the additional parameters: momentum \(\gamma\), weight decay \(\lambda\), and current velocity \(v\), we use the following update equations
Example usage:
from neon.optimizers import GradientDescentMomentum
# use SGD with learning rate 0.01 and momentum 0.9, while
# clipping the gradients between -5 and 5.
opt = GradientDescentMomentum(0.01, 0.9, gradient_clip_value = 5)
RMS propagation¶
Root Mean Square (RMS) propagation protects against vanishing and exploding gradients. In RMSprop, the gradient is divided by a running average of recent gradients. Given the parameters \(\theta\), gradient \(\nabla J\), we keep a running average \(\mu\) of the last \(1/\lambda\) gradients squared. The update equations are then given by
where we use \(\epsilon\) as a (small) smoothing factor to prevent from dividing by zero.
When reaching a plateau in the error surface, the gradient is very small, but the normalization factor here increases the update step for faster learning (small update: \(\alpha\nabla J = 0.0001\), but square root of the weighted average: \(\sqrt{\mu}= 0.00002\), yielding an update of 0.2). If the gradients are exploding, RMSprop also provides protection (large update: \(\alpha\nabla J = 100\), but the weighted average \(\sqrt{\mu} = 20\), yielding a much smaller update of 5). Because of these advantages, RMSprop is often used in recurrent neural networks to protect against vanishing or exploding gradients.
Example usage:
from neon.optimizers import RMSprop
# RMSprop
optimizer = RMSProp(decay_rate=0.95, learning_rate=2e-3)
Adagrad¶
Adagrad is an algorithm that adapts the learning rate individually for each parameter by dividing by the \(L_2\)-norm of all previous gradients. Given the parameters \(\theta\), gradient \(\nabla J\), accumulating norm \(G\), and smoothing factor \(\epsilon\), we use the update equations:
where the smoothing factor \(epsilon\) prevents from dividing by zero. By adjusting the learning rate individually for each parameter, Adagrad adapts to the geometry of the error surface. Differently scaled weights have appropriately scaled update steps.
Example usage:
from neon.optimizers import Adagrad
# use Adagrad with a learning rate of 0.01
optimizer = Adagrad(learning_rate=0.01, epsilon=1e-6)
Adadelta¶
Adadelta was designed to address two drawbacks of the above Adagrad algorithm:
- Continual decay of learning rates over training caused by the accumulation of the \(L_2\)-norm.
- Need for a manually tuned learning rate \(\alpha\)
Similar to RMSprop, Adadelta tracks the running average of the
gradients, \(\mu_J\), over a window size \(1/\lambda\), where
\(\lambda\) is the parameter decay
. Adadelta also tracks an average of the
recent update steps, which we denote as \(\mu_\theta\), and sets the learning rate as the ratio of the two averages:
Note that the learning rate is a ratio of the average updates from the previous step, \(\mu_\theta\), divided by the average gradients including the current step, \(\mu'_J\).
Example usage:
from neon.optimizers import Adadelta
# use Adagrad with a learning rate of 0.01
optimizer = Adadelta(decay=0.95, epsilon=1e-6)
Adam¶
The Adam optimizer combines features from RMSprop and Adagrad. We accumulate both the first and second moments of the gradient with decay rates \(\beta_1\) and \(\beta_2\) corresponding to window sizes of \(1/\beta_1\) and \(1/\beta_2\), respectively.
We update the parameters by the ratio of the two moments:
where we compute the bias-corrected moments \(\hat{m}'\) and \(\hat{v}'\) via
Example usage:
from neon.optimizers import Adam
# use Adam
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
Using multiple optimizers¶
Often, we may want to assign differently configured optimizers to different layers. For example, when training AlexNet, the learning rates and schedules for the bias layers are different from the convolutional and pooling layers. We first define the different optimizers:
from neon.optimizers import GradientDescentMomentum, RMSprop
optimizer_A = GradientDescentMomentum(learning_rate=0.01, momentum_coef=0.9)
optimizer_B = GradientDescentMomentum(learning_rate=0.05, momentum_coef=0.9)
optimizer_C = RMSprop(learning_rate=2e-3, decay_rate=0.95)
Then, we instantiate a neon.optimizers.MultiOptimizer
and pass a
dictionary mapping layers to optimizers. The keys can either be:
default
, a layer class name (e.g. Bias
), or the Layer’s name
attribute. The latter takes precedence for finer layer-to-layer control.
For example, if we have the following layers,
layers = []
layers.append(Linear(nout = 100, init=Gaussian(), name="layer_one"))
layers.append(Linear(nout = 50, init=Gaussian(), name="layer_two"))
layers.append(Affine(nout = 5, init=Gaussian(), activation=Softmax()))
we can define multiple optimizers with
from neon.optimizers import MultiOptimizer
# dictionary of mappings
mapping = {'default': optimizer_A, # default optimizer
'Linear': optimizer_B, # all layers from the Linear class
'layer_two': optimizer_C} # this overrides the previous entry for a specific layer
# use multiple optimizers
opt = MultiOptimizer(mapping)
After definition, we have the following mapping
Layer | Optimizer |
---|---|
layer_one |
optimizer_B |
layer_two |
optimizer_C |
Affine.Linear |
optimizer_B |
Affine.Bias |
optimizer_A |
Affine.Softmax |
None (no parameters) |
Creating new optimizers¶
To create new optimizers, subclass from neon.optimizers.Optimizer
and implement the constructor and the optimize
method:
"""
Constructor to include arguments for optimizer-specific parameters,
stochastic rounding (optional), gradient clipping (optional), and gradient scaling (optional)
"""
def __init__(self, myparam_1, stochastic_round=False, \
gradient_clip_value=None, gradient_clip_norm=None):
"""
Given the model's layers and the current training epoch,
iterate over each layer and update the weights.
"""
def optimize(self, layer_list, epoch):
Neon provides helper methods to iterate over the layers. Here is the
skeleton for a custom optimize
method.
def optimize(self, layer_list, epoch):
# get a flattened list of layer weights
param_list = get_param_list(layer_list)
# iterate over the weights (param), gradients (grad), and
# any accumulated variables (states)
for (param, grad), states in param_list:
# if states not initialized, allocate with zeros
if len(states) == 0:
states.append(self.be.zeros_like(grad))
# scale gradient by size of minibatch (be.bsz)
grad = grad / self.be.bsz
delta_param = # enter your update equations
param[:] = param + delta_param
For more guidance, consult the source code for the existing optimization
algorithms in neon/optimizers/optimizer.py
.