Design Decisions¶
Computation backend¶
- All objects inherit from
NervanaObject
which has a staticbe
variable which is the computation backend being used (gpu, mkl, or cpu).be
stores other important attributes like batch size and data type.- A backend must first be generated before running a model using
gen_backend
. - If swapping backends, then buffers must be reinitialized by reinstantiating the model layers and calling fprop with the new generated backend.
Data Layout¶
Neon’s layers internally store data as two-dimensional tensors. For convolution and pooling layers, the data is formatted in \((C, H, W, N)\) layout (\(C\) = channels, \(H\) = height, \(W\) = width, \(N\) = batch size), and represented as a tensor of shape \((F, N)\), where \(F = C * H *W\).
For recurrent layers, the time dimension \(T\) is added to the \(N\) dimension, so the data format is \((F, T*N)\). The second dimension is ordered by incrementing the batch index first: \(t_1n_1, t_1n_2, ... t_1n_N, t_2n_1, t_2n_2, ...\)
Layers¶
- Most layers are in layer.py, recurrent layers are in recurrent.py, and merge layers for concatenating or summing input are in merge.py.
Composite layers¶
- Some layers (for convenience) are composite layers made as lists of other layers.
- Conv is a list of Convolution, Bias, and Activation layers
- Affine is a list of Linear, Bias, and Activation layers
- This allows flexibility in adding optional bias and activation layers without having to specify these as separate layers.
Layer buffer allocations¶
- Data buffers
- A layer infers input shape from previous layers and initializes buffers accordingly.
- Pre-allocating activation buffers allows buffer reuse and reduces memory usage.
- Buffers will be reinitialized during the next fprop if the layer is reinstantiated.
- Parameter layers (
Linear
,Bias
,Convolution
, andBatchNorm
) maintain their own parametersW
, gradientsdW
, and statesstates
(for the optimizer). - In general, layer buffer allocation is kicked off by the containing model, being
called prior to the first
fit
oreval
call.
Initialization¶
- Weight initialization routines are in
initializers.py
and all have afill
method that describe how they will fill a given param buffer. - The weight initialization object is passed to the layer constructor and
the layer will fill the parameters during
init_params
.
Models¶
Model container¶
- The model provides a container of all the network layers and provides function calls to run and train the network. It is also responsible for initializing and allocating layer parameter buffers.
- We can create a list of layers and give that to the model.
- When forward or backward propagation functions are called, the model will iterate through all the layers to forward pass the inputs and backward pass the errors.
Learning¶
- When training the model, the following necessary components will be provided:
- a training set object that can iterate over training data
- an optimizer that applies to all the layer updates or a multi-optimizer that maps different optimizers to different layers by layer name
- a cost function to compute the error
- callback object that configures whether to use a validation set and how frequent in the training to validate, whether to get progress bar display, etc. For more information, see neon fundamentals – callbacks.
- During update, the model sends a list containing all layers with learnable parameters
to the optimizer.
- The optimizer will then grab a tuple of
(W, dW, state)
from each layer and apply the updates.
- The optimizer will then grab a tuple of
Choice of sizes¶
- We will get better utilization if we pick more friendly sizes for batch size, sequence length, or feature size.
- Our GPU kernels are optimized for sizes being multiples of 4.
- In many of our examples, we use parameters from reference implementations. However, it is recommended to use multiples of 4. In many cases, zero-padding is needed to implement the same model.