neon.optimizers.optimizer.GradientDescentMomentum

class neon.optimizers.optimizer.GradientDescentMomentum(learning_rate, momentum_coef, stochastic_round=False, wdecay=0.0, gradient_clip_norm=None, gradient_clip_value=None, param_clip_value=None, name=None, schedule=<neon.optimizers.optimizer.Schedule object>, nesterov=False)[source]

Bases: neon.optimizers.optimizer.Optimizer

Stochastic gradient descent with momentum.

Given the parameters \(\theta\), the learning rate \(\alpha\), and the gradients \(\nabla J(\theta; x)\) computed on the minibatch data \(x\), SGD updates the parameters via

\[\theta' = \theta - \alpha\nabla J(\theta; x)\]

Here we implement SGD with momentum. Momentum tracks the history of gradient updates to help the system move faster through saddle points. Given the additional parameters: momentum \(\gamma\), weight decay \(\lambda\), and current velocity \(v\), we use the following update equations

\[v' = \gamma v - \alpha(\nabla J(\theta; x) + \lambda\theta) theta' = \theta + v'\]

The optional nesterov parameter implements Nesterov Accelerated Gradient. If this is set, we use the following update equations instead .. math:

v' = \gamma^2 v + \alpha (\gamma + 1) (\nabla J(\theta; x) + \lambda\theta)
theta' = \theta + v'

Example usage:

from neon.optimizers import GradientDescentMomentum

# use SGD with learning rate 0.01 and momentum 0.9, while
# clipping the gradient magnitude to between -5 and 5.
opt = GradientDescentMomentum(0.01, 0.9, gradient_clip_value = 5)
__init__(learning_rate, momentum_coef, stochastic_round=False, wdecay=0.0, gradient_clip_norm=None, gradient_clip_value=None, param_clip_value=None, name=None, schedule=<neon.optimizers.optimizer.Schedule object>, nesterov=False)[source]

Class constructor.

Parameters:
  • learning_rate (float) – Multiplicative coefficient of updates
  • momentum_coef (float) – Coefficient of momentum
  • stochastic_round (bool, optional) – Set this to True for stochastic rounding. If False (default) rounding will be to nearest. If True use default width stochastic rounding. Note that this only affects the GPU backend.
  • wdecay (float, optional) – Amount of weight decay. Defaults to 0
  • gradient_clip_norm (float, optional) – Target gradient norm. Defaults to None.
  • gradient_clip_value (float, optional) – Value to element-wise clip gradients. Defaults to None.
  • param_clip_value (float, optional) – Value to element-wise clip parameters. Defaults to None.
  • name (str, optional) – the optimizer’s layer’s pretty-print name. Defaults to “gdm”.
  • schedule (neon.optimizers.optimizer.Schedule, optional) – Learning rate schedule. Defaults to a constant learning rate.
  • nesterov (bool, optional) – Use nesterov accelerated gradient. Defaults to False.

Methods

__init__(learning_rate, momentum_coef[, …]) Class constructor.
clip_gradient_norm(param_list, clip_norm) Returns a scaling factor to apply to the gradients.
clip_value(v[, abs_bound]) Element-wise clip a gradient or parameter tensor to between -abs_bound and +abs_bound.
gen_class(pdict)
get_description([skip]) Returns a dict that contains all necessary information needed to serialize this object.
optimize(layer_list, epoch) Apply the learning rule to all the layers and update the states.
recursive_gen(pdict, key) helper method to check whether the definition
be = None
classnm

Returns the class name.

clip_gradient_norm(param_list, clip_norm)

Returns a scaling factor to apply to the gradients.

The scaling factor is computed such that the root mean squared average of the scaled gradients across all layers will be less than or equal to the provided clip_norm value. This factor is always <1, so never scales up the gradients.

Parameters:
  • param_list (list) – List of layer parameters
  • clip_norm (float, optional) – Target norm for the gradients. If not provided the returned scale_factor will equal 1.
Returns:

Computed scale factor.

Return type:

scale_factor (float)

clip_value(v, abs_bound=None)

Element-wise clip a gradient or parameter tensor to between -abs_bound and +abs_bound.

Parameters:
  • v (tensor) – Tensor of gradients or parameters for a single layer
  • abs_bound (float, optional) – Value to element-wise clip gradients or parameters. Defaults to None.
Returns:

Tensor of clipped gradients or parameters.

Return type:

v (tensor)

gen_class(pdict)
get_description(skip=[], **kwargs)

Returns a dict that contains all necessary information needed to serialize this object.

Parameters:skip (list) – Objects to omit from the dictionary.
Returns:Dictionary format for object information.
Return type:(dict)
modulenm

Returns the full module path.

optimize(layer_list, epoch)[source]

Apply the learning rule to all the layers and update the states.

Parameters:
  • layer_list (list) – a list of Layer objects to optimize.
  • epoch (int) – the current epoch, needed for the Schedule object.
recursive_gen(pdict, key)

helper method to check whether the definition dictionary is defining a NervanaObject child, if so it will instantiate that object and replace the dictionary element with an instance of that object