Optimizers

Critical to any deep learning model is the optimization of weights to perform the given task. This set of classes provide options for selecting and customizing the appropriate optimization algorithm. Neon supports the following optimizers:

Function Description
neon.optimizers.GradientDescentMomentum Stochastic gradient descent with momentum
neon.optimizers.RMSProp Root Mean Square propagation (see Hinton’s slides)
neon.optimizers.Adagrad Adagrad method for adapting the learning rate
neon.optimizers.Adadelta Adadelta method for adapting the learning rate
neon.optimizers.Adam Adam optimization algorithm
neon.optimizers.MultiOptimizer Class for assigning optimizers to different layers

Each optimization algorithm inherits from the neon.optimizers.Optimizer class and implements the optimize method:

"""
Given the model's layers and the current training epoch,
iterate over each layer and update the weights.
"""
def optimize(self, layer_list, epoch):

The neon.optimizers.Optimizer base class also implements two methods:

  1. clip_gradient_value, which clips each gradient between \(-k\) and \(k\), where \(k\) is the argument gradient_clip_value.
  2. clip_gradient_norm, which scales each gradient by \(k\), which is the argument gradient_clip_norm.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) has existed for awhile, but its usefulness in training deep neural networks has only recently been realized. SGD is similar to traditional gradient descent, except that the gradient updates are computed over a small subset of the total training data (i.e., a minibatch).

Given the parameters \(\theta\), the learning rate \(\alpha\), and the gradients \(\nabla J(\theta; x)\) computed on the minibatch data \(x\), SGD updates the parameters via

\[\theta' = \theta - \alpha\nabla J(\theta; x)\]

Here we implement SGD with momentum. Momentum tracks the history of gradient updates to help the system move faster through saddle points. Given the additional parameters: momentum \(\gamma\), weight decay \(\lambda\), and current velocity \(v\), we use the following update equations

\[\begin{split}v' &= \gamma v - \alpha(\nabla J(\theta; x) + \lambda\theta) \\ \theta' &= \theta + v'\end{split}\]

Example usage:

from neon.optimizers import GradientDescentMomentum
# use SGD with learning rate 0.01 and momentum 0.9, while
# clipping the gradients between -5 and 5.
opt = GradientDescentMomentum(0.01, 0.9, gradient_clip_value = 5)

RMS propagation

Root Mean Square (RMS) propagation protects against vanishing and exploding gradients. In RMSprop, the gradient is divided by a running average of recent gradients. Given the parameters \(\theta\), gradient \(\nabla J\), we keep a running average \(\mu\) of the last \(1/\lambda\) gradients squared. The update equations are then given by

\[\begin{split}\mu' &= \lambda\mu + (1-\lambda)(\nabla J)^2 \\ \theta' &= \theta - \frac{\alpha}{\sqrt{\mu + \epsilon} + \epsilon}\nabla J\end{split}\]

where we use \(\epsilon\) as a (small) smoothing factor to prevent from dividing by zero.

When reaching a plateau in the error surface, the gradient is very small, but the normalization factor here increases the update step for faster learning (small update: \(\alpha\nabla J = 0.0001\), but square root of the weighted average: \(\sqrt{\mu}= 0.00002\), yielding an update of 0.2). If the gradients are exploding, RMSprop also provides protection (large update: \(\alpha\nabla J = 100\), but the weighted average \(\sqrt{\mu} = 20\), yielding a much smaller update of 5). Because of these advantages, RMSprop is often used in recurrent neural networks to protect against vanishing or exploding gradients.

Example usage:

from neon.optimizers import RMSprop
# RMSprop
optimizer = RMSProp(decay_rate=0.95, learning_rate=2e-3)

Adagrad

Adagrad is an algorithm that adapts the learning rate individually for each parameter by dividing by the \(L_2\)-norm of all previous gradients. Given the parameters \(\theta\), gradient \(\nabla J\), accumulating norm \(G\), and smoothing factor \(\epsilon\), we use the update equations:

\[\begin{split}G' &= G + (\nabla J)^2 \\ \theta' &= \theta - \frac{\alpha}{\sqrt{G' + \epsilon}} \nabla J\end{split}\]

where the smoothing factor \(epsilon\) prevents from dividing by zero. By adjusting the learning rate individually for each parameter, Adagrad adapts to the geometry of the error surface. Differently scaled weights have appropriately scaled update steps.

Example usage:

from neon.optimizers import Adagrad
# use Adagrad with a learning rate of 0.01
optimizer = Adagrad(learning_rate=0.01, epsilon=1e-6)

Adadelta

Adadelta was designed to address two drawbacks of the above Adagrad algorithm:

  1. Continual decay of learning rates over training caused by the accumulation of the \(L_2\)-norm.
  2. Need for a manually tuned learning rate \(\alpha\)

Similar to RMSprop, Adadelta tracks the running average of the gradients, \(\mu_J\), over a window size \(1/\lambda\), where \(\lambda\) is the parameter decay. Adadelta also tracks an average of the recent update steps, which we denote as \(\mu_\theta\), and sets the learning rate as the ratio of the two averages:

\[\begin{split}\mu_J' &= \lambda\mu_J + (1-\lambda) (\nabla J)^2 \\ \Delta \theta &= \sqrt{\frac{\mu_\theta + \epsilon}{\mu_J' + \epsilon}} \nabla J \\ \mu_\theta &= \lambda \mu_\theta + (1-\rho) (\Delta \theta)^2 \\ \theta &= \theta - \Delta \theta\end{split}\]

Note that the learning rate is a ratio of the average updates from the previous step, \(\mu_\theta\), divided by the average gradients including the current step, \(\mu'_J\).

Example usage:

from neon.optimizers import Adadelta
# use Adagrad with a learning rate of 0.01
optimizer = Adadelta(decay=0.95, epsilon=1e-6)

Adam

The Adam optimizer combines features from RMSprop and Adagrad. We accumulate both the first and second moments of the gradient with decay rates \(\beta_1\) and \(\beta_2\) corresponding to window sizes of \(1/\beta_1\) and \(1/\beta_2\), respectively.

\[\begin{split}m' &= \beta_1 m + (1-\beta_1) \nabla J \\ v' &= \beta_2 v + (1-\beta_2) (\nabla J)^2\end{split}\]

We update the parameters by the ratio of the two moments:

\[\theta = \theta - \alpha \frac{\hat{m}'}{\sqrt{\hat{v}'}+\epsilon}\]

where we compute the bias-corrected moments \(\hat{m}'\) and \(\hat{v}'\) via

\[\begin{split}\hat{m}' &= m'/(1-\beta_1^t) \\ \hat{v}' &= v'/(1-\beta_1^t)\end{split}\]

Example usage:

from neon.optimizers import Adam
# use Adam
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

Using multiple optimizers

Often, we may want to assign differently configured optimizers to different layers. For example, when training AlexNet, the learning rates and schedules for the bias layers are different from the convolutional and pooling layers. We first define the different optimizers:

from neon.optimizers import GradientDescentMomentum, RMSprop
optimizer_A = GradientDescentMomentum(learning_rate=0.01, momentum_coef=0.9)
optimizer_B = GradientDescentMomentum(learning_rate=0.05, momentum_coef=0.9)
optimizer_C = RMSprop(learning_rate=2e-3, decay_rate=0.95)

Then, we instantiate a neon.optimizers.MultiOptimizer and pass a dictionary mapping layers to optimizers. The keys can either be: default, a layer class name (e.g. Bias), or the Layer’s name attribute. The latter takes precedence for finer layer-to-layer control.

For example, if we have the following layers,

layers = []
layers.append(Linear(nout = 100, init=Gaussian(), name="layer_one"))
layers.append(Linear(nout = 50, init=Gaussian(), name="layer_two"))
layers.append(Affine(nout = 5, init=Gaussian(), activation=Softmax()))

we can define multiple optimizers with

from neon.optimizers import MultiOptimizer
# dictionary of mappings
mapping = {'default': optimizer_A, # default optimizer
           'Linear': optimizer_B, # all layers from the Linear class
           'layer_two': optimizer_C} # this overrides the previous entry for a specific layer
# use multiple optimizers
opt = MultiOptimizer(mapping)

After definition, we have the following mapping

Layer Optimizer
layer_one optimizer_B
layer_two optimizer_C
Affine.Linear optimizer_B
Affine.Bias optimizer_A
Affine.Softmax None (no parameters)

Creating new optimizers

To create new optimizers, subclass from neon.optimizers.Optimizer and implement the constructor and the optimize method:

"""
Constructor to include arguments for optimizer-specific parameters,
stochastic rounding (optional), gradient clipping (optional), and gradient scaling (optional)
"""
def __init__(self, myparam_1, stochastic_round=False, \
             gradient_clip_value=None, gradient_clip_norm=None):
"""
Given the model's layers and the current training epoch,
iterate over each layer and update the weights.
"""
def optimize(self, layer_list, epoch):

Neon provides helper methods to iterate over the layers. Here is the skeleton for a custom optimize method.

def optimize(self, layer_list, epoch):
    # get a flattened list of layer weights
    param_list = get_param_list(layer_list)
    # iterate over the weights (param), gradients (grad), and
    # any accumulated variables (states)
    for (param, grad), states in param_list:
        # if states not initialized, allocate with zeros
        if len(states) == 0:
            states.append(self.be.zeros_like(grad))
        # scale gradient by size of minibatch (be.bsz)
        grad = grad / self.be.bsz
        delta_param = # enter your update equations
        param[:] = param + delta_param

For more guidance, consult the source code for the existing optimization algorithms in neon/optimizers/optimizer.py.