neon.data.text.Text

class neon.data.text.Text(time_steps, path, vocab=None, tokenizer=None, onehot_input=True, reverse_target=False, get_prev_target=False)[source]

Bases: neon.data.dataiterator.NervanaDataIterator

This class defines methods for loading and iterating over text datasets.

__init__(time_steps, path, vocab=None, tokenizer=None, onehot_input=True, reverse_target=False, get_prev_target=False)[source]

Construct a text dataset object.

Parameters:
  • time_steps (int) – Length of a sequence.
  • path (str) – Path to text file.
  • vocab (python.set) – A set of unique tokens.
  • tokenizer (function) – Tokenizer function.
  • onehot_input (boolean) – One-hot representation of input
  • reverse_target (boolean) – for sequence to sequence models, set to True to reverse target sequence. Also disables shifting target by one.
  • get_prev_target (boolean) – for sequence to sequence models, set to True for training data to provide correct target from previous time step as decoder input. If condition, shape will be a tuple of shapes, corresponding to encoder and decoder inputs.

Methods

__init__(time_steps, path[, vocab, …]) Construct a text dataset object.
create_valid_file(path[, valid_split]) Create separate files for training and validation.
gen_class(pdict)
get_description([skip]) Returns a dict that contains all necessary information needed to serialize this object.
get_tokens(string[, tokenizer]) Map string to a list of tokens.
get_vocab(tokens[, vocab]) Construct vocabulary from the given tokens.
nbatches() Return the number of minibatches in this dataset.
pad_data(path[, vocab_size, …]) Deprecated, use neon.data.text_preprocessing.pad_data.
pad_sentences(sentences[, sentence_length, …]) Deprecated, use neon.data.text_preprocessing.pad_sentences.
recursive_gen(pdict, key) helper method to check whether the definition
reset() Reset the starting index of this dataset back to zero.
be = None
classnm

Returns the class name.

static create_valid_file(path, valid_split=0.1)[source]

Create separate files for training and validation.

Parameters:
  • path (str) – Path to data file.
  • valid_split (float, optional) – Fraction of data to set aside for validation.
Returns:

Paths to train file and validation file

Return type:

str, str

gen_class(pdict)
get_description(skip=[], **kwargs)

Returns a dict that contains all necessary information needed to serialize this object.

Parameters:skip (list) – Objects to omit from the dictionary.
Returns:Dictionary format for object information.
Return type:(dict)
static get_tokens(string, tokenizer=None)[source]

Map string to a list of tokens.

Parameters:
  • string (str) – String to be tokenized.
  • token (object) – Tokenizer object.
  • tokenizer (function) – Tokenizer function.
Returns:

A list of tokens

Return type:

list

static get_vocab(tokens, vocab=None)[source]

Construct vocabulary from the given tokens.

Parameters:
  • tokens (list) – List of tokens.
  • vocab – (Default value = None)
Returns:

A set of unique tokens

Return type:

python.set

modulenm

Returns the full module path.

nbatches()

Return the number of minibatches in this dataset.

static pad_data(path, vocab_size=20000, sentence_length=100, oov=2, start=1, index_from=3, seed=113, test_split=0.2)[source]

Deprecated, use neon.data.text_preprocessing.pad_data.

static pad_sentences(sentences, sentence_length=None, dtype=<class 'numpy.int32'>, pad_val=0.0)[source]

Deprecated, use neon.data.text_preprocessing.pad_sentences.

recursive_gen(pdict, key)

helper method to check whether the definition dictionary is defining a NervanaObject child, if so it will instantiate that object and replace the dictionary element with an instance of that object

reset()[source]

Reset the starting index of this dataset back to zero. Relevant for when one wants to call repeated evaluations on the dataset but don’t want to wrap around for the last uneven minibatch Not necessary when ndata is divisible by batch size