Dataset class is a base class for commonly-used
datasets. We recommend creating an object class for your dataset that
handles the loading and preprocessing of the data. Datasets should
gen_iterators(), which returns a dictionary data
iterator used for training and evaluation (see Loading
Neon provides dataset objects for handling many stock datasets.
MNIST, is a dataset of handwritten digits, consisting of 60,000 training samples and 10,000 test samples. Each image is 28x28 greyscale pixels.
MNIST can be fetched in the following manner:
from neon.data import MNIST mnist = MNIST(path='path/to/save/downloadeddata/') train_set = mnist.train_iter valid_set = mnist.valid_iter
path argument desigates the directory to store
the downloaded dataset. If the dataset already exists in that directory,
download will be skipped. The default data path will be used if
is not provided.
CIFAR10, is a dataset consisting of 50,000 training samples and 10,000 test samples. There are 10 categories and each sample is a 32x32 RGB color image.
CIFAR10 can be fetched in the following manner:
from neon.data import CIFAR10 cifar10 = CIFAR10() train = cifar10.train_iter test = cifar10.valid_iter
For existing datasets (e.g. Penn Treebank, Hutter Prize, and
Shakespeare), we have object classes for loading, and sometimes
pre-processing, the data. The online source are stored in the
__init__ method. Some datasets (such as Penn Treebank) also accept a
tokenizer (string) to parse the file. The tokenizer is a string which
matches the name of one of the tokenizers functions that are included in
the class definition. For example, the method
PTB class replaces all newline characters (i.e.
<eos> and splits the string. These datasets use
to return a iterator (
from neon.data import PTB # download Penn Treebank and parse at the word level ptb = PTB(time_steps, tokenizer="newline_tokenizer") train_set = ptb.train_iter
The raw images need to be downloaded from ILSVRC as a tar file. Because
the data is too large to fit in memory, the data must be loaded from disk to host,
and then from host to device (if using a non-cpu backend), while being augmented
appropriately. For this type of data, we use the aeon dataloader which is
described in Loading data. Example of how to use aeon
with ImageNet in particular are shown in
examples/imagenet, with the data
preparation procedure (extracting from tar, resizing the images, generating manifest
files listing images and labels) encapsulated in the script
QA and bAbI¶
bAbI dataset object can be created by specifying which task and which
subset (20 tasks and 4 subsets in bAbI) to retrieve. The object will use
built-in metadata to get bAbI data from online sources, save and unzip
the files for that task locally, and then vectorize the
story-question-answer data. The training and test files are both needed
to build a vocabulary set.
A general question and answering container can take the story-question-answer data from a bAbI data object and create a data iterator for training.
from neon.data import BABI from neon.data import QA # get the bAbI data babi = BABI(path='.', task='qa15_basic-deduction', subset='en') # create a QA iterator train_set = QA(*babi.train) valid_set = QA(*babi.test)
Low level dataset operations¶
Some applications require access to the underlying data to generate more
complex data iterators. This can be done by using the
method of the DataSet class and its subclasses. The method returns
the data arrays which are used to generate the data iterators. For
example, the code below shows how to generate a data iterator to
train an autoencoder on the MNIST dataset:
from neon.data import MNIST from neon.data import ArrayIterator mnist = MNIST() # get the raw data arrays, both train set and validation set (X_train, y_train), (X_test, y_test), nclass = mnist.load_data() # generate and ArrayIterator with no target data # this will return the image itself as the target train = ArrayIterator(X_train, lshape=(1, 28, 28))