Datasets

Dataset class is a base class for commonly-used datasets. We recommend creating an object class for your dataset that handles the loading and preprocessing of the data. Datasets should implement gen_iterators(), which returns a dictionary data iterator used for training and evaluation (see Loading data).

Neon provides dataset objects for handling many stock datasets.

MNIST

MNIST, is a dataset of handwritten digits, consisting of 60,000 training samples and 10,000 test samples. Each image is 28x28 greyscale pixels.

MNIST can be fetched in the following manner:

from neon.data import MNIST

mnist = MNIST(path='path/to/save/downloadeddata/')
train_set = mnist.train_iter
valid_set = mnist.valid_iter

The path argument desigates the directory to store the downloaded dataset. If the dataset already exists in that directory, download will be skipped. The default data path will be used if path is not provided.

CIFAR10

CIFAR10, is a dataset consisting of 50,000 training samples and 10,000 test samples. There are 10 categories and each sample is a 32x32 RGB color image.

CIFAR10 can be fetched in the following manner:

from neon.data import CIFAR10
cifar10 = CIFAR10()
train = cifar10.train_iter
test = cifar10.valid_iter

ImageCaption

This dataset uses precomputed CNN image features and caption sentences. It works with the flickr8k, flickr30k, and COCO datasets and uses the VGG image features and sentences from http://cs.stanford.edu/people/karpathy/deepimagesent/ which have been converted to python .pkl format. These datasets have 5 reference sentences per image. For each sentence, the dataset converts each word to its 1-hot representation so that each input batch of sentences is of dimension (vocab_size, max_sentence_length * batch_size).

The image caption data can be fetched in the following manner:

# download dataset
from neon.data import Flickr8k
flickr8k = Flickr8k()  # Other set names are Flickr30k and Coco
train_set = flickr8k.train_iter

Text

For existing datasets (e.g. Penn Treebank, Hutter Prize, and Shakespeare), we have object classes for loading, and sometimes pre-processing, the data. The online source are stored in the __init__ method. Some datasets (such as Penn Treebank) also accept a tokenizer (string) to parse the file. The tokenizer is a string which matches the name of one of the tokenizers functions that are included in the class definition. For example, the method newline_tokenizer in the PTB class replaces all newline characters (i.e. \n) with the string <eos> and splits the string. These datasets use gen_iterators() to return a iterator (Text)

from neon.data import PTB

# download Penn Treebank and parse at the word level
ptb = PTB(time_steps, tokenizer="newline_tokenizer")
train_set = ptb.train_iter

ImageNet

The raw images need to be downloaded from ILSVRC as a tar file. Because the data is too large to fit in memory, the data must be loaded in batches (called “macrobatches”, see Loading data ). We first write the macrobatches with the batch_writer.py script. data_dir is where the processed batches will be stored, and image_dir is where the original tar files are saved.

python neon/data/batch_writer.py  --data_dir /usr/local/data/tmp \
                                  --image_dir /usr/local/data/I1K/imagenet_orig \
                                  --set_type i1k

We then create the ImageNet dataset object and get the training data iterator, which is of the ImageLoader class. ImageLoader allows for fast loading and feeding of macrobatches to the model.

from neon.data import I1K

# create the I1K object
i1k = I1K(data_dir = args.data_dir, inner_size=224, subset_pct=100)

# fetch a dict of iterators
# iter['train'] is an iterator (neon.data.ImageLoader) for the training data
# iter['val'] is an iterator for the validation data
iters = i1k.gen_iterators()

QA and bAbI

A bAbI dataset object can be created by specifying which task and which subset (20 tasks and 4 subsets in bAbI) to retrieve. The object will use built-in metadata to get bAbI data from online sources, save and unzip the files for that task locally, and then vectorize the story-question-answer data. The training and test files are both needed to build a vocabulary set.

A general question and answering container can take the story-question-answer data from a bAbI data object and create a data iterator for training.

# get the bAbI data
babi = BABI(path='.', task='qa15_basic-deduction', subset='en')

# create a QA iterator
train_set = QA(*babi.train)
valid_set = QA(*babi.test)

Low level dataset operations

Some applications require access to the underlying data to generate more complex data iterators. This can be done by using the load_data method of the DataSet class and its subclasses. The method returns the data arrays which are used to generate the data iterators. For example, the code below shows how to generate a data iterator to train an autoencoder on the MNIST dataset:

from neon.data import MNIST
mnist = MNIST()
# get the raw data arrays, both train set and validation set
(X_train, y_train), (X_test, y_test), nclass = mnist.load_data()

# generate and ArrayIterator with no target data
# this will return the image itself as the target
train = ArrayIterator(X_train, lshape=(1, 28, 28))