pytext.data package

Submodules

pytext.data.batch_sampler module

class pytext.data.batch_sampler.AlternatingRandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.RandomizedBatchSampler

This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.AlternatingRandomizedBatchSampler.Config)[source]
class pytext.data.batch_sampler.BaseBatchSampler[source]

Bases: pytext.config.component.Component

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.config.component.Component.Config)[source]
class pytext.data.batch_sampler.EvalBatchSampler[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

Output: [A, B, C, D, a, b]

batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key.

Parameters:iterators – Dictionary of iterators
class pytext.data.batch_sampler.NaturalBatchSampler(dataset_counts: Dict[str, int])[source]

Bases: pytext.data.batch_sampler.RandomizedBatchSampler

This sampler iterates over all the datasets, sampling according to the weighted number of samples in each dataset.

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.NaturalBatchSampler.Config)[source]
class pytext.data.batch_sampler.RandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.

Example

Iterator A: [A, B, C, D], Iterator B: [a, b]

batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]

Parameters:unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.RandomizedBatchSampler.Config)[source]
class pytext.data.batch_sampler.RoundRobinBatchSampler(iter_to_set_epoch: Optional[str] = None)[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.

If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.

If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]

iter_to_set_epoch = None Output: [A, a, B, b]

Parameters:iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key until the target iterator reaches its end.

Parameters:iterators – Dictionary of iterators
classmethod from_config(config: pytext.data.batch_sampler.RoundRobinBatchSampler.Config)[source]
pytext.data.batch_sampler.extract_iterator_properties(input_iterator_probs: Dict[str, float])[source]

Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to generate iterator properties: iterator_names and iterator_probs.

pytext.data.batch_sampler.select_key_and_batch(iterator_names: Dict[str, str], iterator_probs: Dict[str, float], iter_dict: Dict[str, collections.abc.Iterator], iterators: Dict[str, collections.abc.Iterator])[source]

Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to select a key from iterator_names using iterator_probs and return a batch for the selected key using iter_dict and iterators.

pytext.data.bert_tensorizer module

class pytext.data.bert_tensorizer.BERTTensorizer(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBase

Tensorizer for BERT tasks. Works for single sentence, sentence pair, triples etc.

classmethod from_config(config: pytext.data.bert_tensorizer.BERTTensorizer.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

class pytext.data.bert_tensorizer.BERTTensorizerBase(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, base_tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]

Bases: pytext.data.tensorizers.Tensorizer

Base Tensorizer class for all BERT style models including XLM, RoBERTa and XLM-R.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

initialize(vocab_builder=None, from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row: Dict[KT, VT]) → Tuple[Any, ...][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

sort_key(row)[source]
tensorize(batch) → Tuple[torch.Tensor, ...][source]

Convert instance level vectors into batch level tensors.

tensorizer_script_impl = None
class pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]

Bases: pytext.data.tensorizers.TensorizerScriptImpl

forward(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Wire up tokenize(), numberize() and tensorize() functions for data processing. When export to TorchScript, the wrapper module should choose to use texts or pre_tokenized based on the TorchScript tokenizer implementation (e.g use external tokenizer such as Yoda or not).

numberize(per_sentence_tokens: List[List[Tuple[str, int, int]]]) → Tuple[List[int], List[int], int, List[int]][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

Parameters:
  • per_sentence_tokens – list of tokens per sentence level in one row,
  • token represented by token string, start and end indices. (each) –
Returns:

List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions

Return type:

tokens

tensorize(tokens_2d: List[List[int]], segment_labels_2d: List[List[int]], seq_lens_1d: List[int], positions_2d: List[List[int]]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Convert instance level vectors into batch level tensors.

tokenize(row_text: Optional[List[str]], row_pre_tokenized: Optional[List[List[str]]]) → List[List[Tuple[str, int, int]]][source]

This function convert raw inputs into tokens, each token is represented by token(str), start and end indices in the raw inputs. There are two possible inputs to this function depends if the tokenized in implemented in TorchScript or not.

Case 1: Tokenizer has a full TorchScript implementation, the input will be a list of sentences (in most case it is single sentence or a pair).

Case 2: Tokenizer have partial or no TorchScript implementation, in most case, the tokenizer will be host in Yoda, the input will be a list of pre-processed tokens.

Returns:tokens per sentence level, each token is represented by token(str), start and end indices.
Return type:per_sentence_tokens
torchscriptify()[source]
class pytext.data.bert_tensorizer.BERTTensorizerScriptImpl(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl

pytext.data.bert_tensorizer.build_fairseq_vocab(vocab_file: str, dictionary_class: fairseq.data.dictionary.Dictionary = <class 'fairseq.data.dictionary.Dictionary'>, special_token_replacements: Dict[str, pytext.common.constants.Token] = None, max_vocab: int = -1, min_count: int = -1, tokens_to_add: Optional[List[str]] = None) → pytext.data.utils.Vocabulary[source]

Function builds a PyText vocabulary for models pre-trained using Fairseq modules. The dictionary class can take any Fairseq Dictionary class and is used to load the vocab file.

pytext.data.data module

class pytext.data.data.BatchData(raw_data, numberized)[source]

Bases: tuple

numberized

Alias for field number 1

raw_data

Alias for field number 0

class pytext.data.data.Batcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]

Bases: pytext.config.component.Component

Batcher designed to batch rows of data, before padding.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

Group rows by batch_size. Assume iterable of dicts, yield dict of lists. The last batch will be of length len(iterable) % batch_size.

classmethod from_config(config: pytext.data.data.Batcher.Config)[source]
class pytext.data.data.Data(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = True, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]

Bases: pytext.config.component.Component

Data is an abstraction that handles all of the following:

  • Initialize model metadata parameters
  • Create batches of tensors for model training or prediction

It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.

The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.

add_row_indices(rows)[source]
batches(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]

Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.

stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.

Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.

cache(numberized_rows, stage)[source]
classmethod from_config(config: pytext.data.data.Data.Config, schema: Dict[str, Type[CT_co]], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], rank=0, world_size=1, init_tensorizers=True, **kwargs)[source]
numberize_rows(rows)[source]
class pytext.data.data.PoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1)[source]

Bases: pytext.data.data.Batcher

Batcher that shuffles and (if requested) sorts data.

Rationale

There is a trade-off between having batches of data that are truly randomly shuffled, and batches of data that are efficiently padded. If we wanted to maximise the efficiency of padding (i.e. minimise the amount of padding that is needed), we would have to enforce that all inputs of a similar length appear in the same batch. This however would lead to a dramatic decrease in the randomness of batches. On the other end of the spectrum, if we wanted to maximise randomness, we would often end up with inputs of wildly different lengths in the same batch, which would lead to a lot of padding.

Operation

This batcher uses a multi-staged approach.

  1. It first loads a number of “pools” of data, and shuffles them (this is controlled by num_shuffled_pools).
  2. It then splits up the shuffled data sequentially into individual pools, and the examples within each pool are sorted (if requested).
  3. Finally, each pool is split up sequentially into batches, and yielded. If sorting was requested in step #2, the order in which the batches are yielded is randomised.

The size of a pool is expressed as a multiple of the batch size, and is controlled by pool_num_batches.

Examples

Assuming sorting is enabled, with the default settings of pool_num_batches: 1000 and num_shuffled_pools: 1, a pool of 1k * batch_size examples is loaded, sorted by length, and split up into 1k batches. These batches are then yielded in random order. Once they run out, a new pool is loaded, and the process is repeated. An advantage of this approach is that padding will be somewhat reduced. A disadvantage is that, for every epoch, the first 1k batches will be always the same (albeit in a different order).

On the other hand, specifying pool_num_batches: 1000 and num_shuffled_pools: 1000 would achieve the following: 1k * 1k * batch_size examples are loaded, and shuffled. These are then split up into pools of size 1k * batch_size, which are then sorted internally, split into individual batches, and yielded in random order. Compared to the previous example, we no longer have the problem that the first 1k batches are always the same in each epoch, but we’ve had to load in memory 1M examples.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
classmethod from_config(config: pytext.data.data.PoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
class pytext.data.data.RowData(raw_data, numberized)[source]

Bases: tuple

numberized

Alias for field number 1

raw_data

Alias for field number 0

pytext.data.data.generator_iterator(fn)[source]

Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.

pytext.data.data.pad_and_tensorize_batches(tensorizers, batches)[source]
pytext.data.data.zip_dicts(dicts)[source]

pytext.data.data_handler module

class pytext.data.data_handler.BatchIterator(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]

Bases: object

BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.

Parameters:
  • batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
  • processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
  • include_input (bool) – if input data should be returned, default is true
  • include_target (bool) – if target data should be returned, default is true
  • include_context (bool) – if context data should be returned, default is true
  • is_train (bool) – if the batch data is for training
  • num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
class pytext.data.data_handler.CommonMetadata[source]

Bases: object

class pytext.data.data_handler.DataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]

Bases: pytext.config.component.Component

DataHandler is the central place to prepare data for model training/testing. The class is responsible of:

  • Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
  • Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata

The data processing pipeline contains the following steps:

  • Read data from file into a list of raw data examples
  • Convert each row of row data to a TorchText Example. This logic happens in process_row function and will:
    • Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
    • Use the raw data and results from featurizer to do any preprocessing
  • Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
  • Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
raw_columns

columns to read from data source. The order should match the data stored in that file.

Type:List[str]
featurizer

perform data preprocessing that should be shared between training and inference

Type:Featurizer
features

a dict of name -> field that used to process data as model input

Type:Dict[str, Field]
labels

a dict of name -> field that used to process data as training target

Type:Dict[str, Field]
extra_fields

fields that process any extra data used neither as model input nor target. This is None by default

Type:Dict[str, Field]
text_feature_name

name of the text field, used to define the default sort key of data

Type:str
shuffle

if the dataset should be shuffled, true by default

Type:bool
sort_within_batch

if data within same batch should be sorted, true by default

Type:bool
train_path

path of training data file

Type:str
eval_path

path of evaluation data file

Type:str
test_path

path of test data file

Type:str
train_batch_size

training batch size, 128 by default

Type:int
eval_batch_size

evaluation batch size, 128 by default

Type:int
test_batch_size

test batch size, 128 by default

Type:int
max_seq_len

maximum length of tokens to keep in sequence

Type:int
pass_index

if the original index of data in the batch should be passed along to downstream steps, default is true

Type:bool
gen_dataset(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.legacy.data.dataset.Dataset[source]

Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)

gen_dataset_from_path(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.legacy.data.dataset.Dataset[source]

Generate a dataset from file :returns: dataset (TorchText.Dataset)

get_eval_iter()[source]
get_predict_iter(data: Iterable[Dict[str, Any]], batch_size: Optional[int] = None)[source]
get_test_iter()[source]
get_test_iter_from_path(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_test_iter_from_raw_data(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1)[source]
get_train_iter_from_path(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]

Generate data batch iterator for training data. See _get_train_iter() for details

Parameters:
  • train_path (str) – file path of training data
  • batch_size (int) – batch size
  • rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
  • world_size (int) – used for distributed training, total number of Gpu
get_train_iter_from_raw_data(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
init_feature_metadata(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]
init_metadata()[source]

Initialize metadata using data from configured path

init_metadata_from_path(train_path, eval_path, test_path)[source]

Initialize metadata using data from file

init_metadata_from_raw_data(*data)[source]

Initialize metadata using in memory data

init_target_metadata(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]
load_metadata(metadata: pytext.data.data_handler.CommonMetadata)[source]

Load previously saved metadata

load_vocab(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]

Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.

Parameters:
  • vocab_file (str) – vocab file to load
  • vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
  • lowercase_tokens (bool) – if the tokens should be lowercased
metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

preprocess(data: Iterable[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

read_from_file(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]

Read data from csv file. Input file format is required to be tab-separated columns

Parameters:
  • file_name (str) – csv file name
  • columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
sort_key(example: torchtext.legacy.data.example.Example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

pytext.data.dense_retrieval_tensorizer module

class pytext.data.dense_retrieval_tensorizer.BERTContextTensorizerForDenseRetrieval(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizer

Methods numberize() and tensorize() implement https://fburl.com/an4fv7m1.

numberize(row: Dict[KT, VT]) → Tuple[Any, ...][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model. It works off of one sample.

tensorize(batch)[source]

Works off of a batch that’s numerized.

class pytext.data.dense_retrieval_tensorizer.PositiveLabelTensorizerForDenseRetrieval(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, label_vocab_file: Optional[str] = None, is_input: bool = False, add_labels: Optional[List[str]] = None)[source]

Bases: pytext.data.tensorizers.LabelTensorizer

numberize(row: Dict[KT, VT])[source]

Numberize labels.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.dense_retrieval_tensorizer.RoBERTaContextTensorizerForDenseRetrieval(columns: List[str] = ['text'], vocab: Optional[pytext.data.utils.Vocabulary] = None, tokenizer: Optional[pytext.data.tokenizers.tokenizer.Tokenizer] = None, max_seq_len: int = 256)[source]

Bases: pytext.data.dense_retrieval_tensorizer.BERTContextTensorizerForDenseRetrieval, pytext.data.roberta_tensorizer.RoBERTaTensorizer

classmethod from_config(config: pytext.data.dense_retrieval_tensorizer.RoBERTaContextTensorizerForDenseRetrieval.Config)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

pytext.data.disjoint_multitask_data module

class pytext.data.disjoint_multitask_data.DisjointMultitaskData(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]

Bases: pytext.data.data.Data

Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskData.Config.
  • data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_dict

Data handlers to do roundrobin over.

Type:type
batches(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]

Yield batches from each task, sampled according to a given sampler. This batcher additionally exposes a task name in the batch to allow the model to filter examples to the appropriate tasks.

classmethod from_config(config: pytext.data.disjoint_multitask_data.DisjointMultitaskData.Config, data_dict: Dict[str, pytext.data.data.Data], task_key: str = 'task_name', rank=0, world_size=1, init_tensorizers=True)[source]

pytext.data.disjoint_multitask_data_handler module

class pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
  • data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
  • target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_handlers

Data handlers to do roundrobin over.

Type:type
target_task_name

Used to select best epoch, and set batch_per_epoch.

Type:type
upsample

If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.

Type:bool
get_eval_iter() → pytext.data.data_handler.BatchIterator[source]
get_test_iter() → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1) → Tuple[pytext.data.data_handler.BatchIterator, ...][source]
init_metadata()[source]

Initialize metadata using data from configured path

load_metadata(metadata)[source]

Load previously saved metadata

metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

class pytext.data.disjoint_multitask_data_handler.RoundRobinBatchIterator(iterators: Dict[str, pytext.data.data_handler.BatchIterator], upsample: bool = True, iter_to_set_epoch: Optional[str] = None)[source]

Bases: pytext.data.data_handler.BatchIterator

We take a dictionary of BatchIterators and do round robin over them in a cycle. The below describes the behavior for one epoch, with the example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

If upsample is True:

If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.

iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]

If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.

Output: [A, a, B, b]

If upsample is False:

Iterate over batches from one epoch of each iterator, with the order among iterators uniformly shuffled.

Possible output: [a, A, B, C, b, D]

Parameters:
  • iterators (Dict[str, BatchIterator]) – Iterators to do roundrobin over.
  • upsample (bool) – If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, in random order. Evaluation will use upsample=False. Default True.
  • iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If upsample is True and this is not set, epoch size defaults to the length of the shortest iterator. If upsample is False, this argument is not used.
iterators

Iterators to do roundrobin over.

Type:Dict[str, BatchIterator]
upsample

Whether to upsample iterators with fewer batches.

Type:bool
iter_to_set_epoch

Name of iterator to define epoch size.

Type:str
classmethod cycle(iterator)[source]

pytext.data.dynamic_pooling_batcher module

class pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig(**kwargs)[source]

Bases: pytext.config.module_config.Module.Config

end_batch_size = 256
epoch_period = 10
start_batch_size = 32
step_size = 1
class pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]

Bases: pytext.data.data.PoolingBatcher

Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]
finished_dynamic() → bool[source]
classmethod from_config(config: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
step_epoch()[source]
class pytext.data.dynamic_pooling_batcher.ExponentialBatcherSchedulerConfig(**kwargs)[source]

Bases: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig

gamma = 5
class pytext.data.dynamic_pooling_batcher.ExponentialDynamicPoolingBatcher(*args, **kwargs)[source]

Bases: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher

Exponential Dynamic Batch Scheduler: scales up batch size by a factor of gamma

compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.ExponentialBatcherSchedulerConfig, curr_steps: int) → int[source]
finished_dynamic() → bool[source]
get_max_steps()[source]
class pytext.data.dynamic_pooling_batcher.LinearDynamicPoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]

Bases: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher

Linear Dynamic Batch Scheduler: scales up batch size linearly

compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]

pytext.data.masked_tensorizer module

class pytext.data.masked_tensorizer.MaskedTokenTensorizer(text_column, mask, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]

Bases: pytext.data.tensorizers.TokenTensorizer

classmethod from_config(config: pytext.data.masked_tensorizer.MaskedTokenTensorizer.Config)[source]
mask_and_tensorize(batch)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

pytext.data.masked_util module

class pytext.data.masked_util.MaskEverything(use_bos, use_eos)[source]

Bases: pytext.data.masked_util.MaskingFunction

gen_masked_source_target(tokens, vocab: pytext.data.utils.Vocabulary)[source]
gen_masked_tree(node, mask_token, depth=1)[source]
class pytext.data.masked_util.MaskedVocabBuilder(delimiter=' ')[source]

Bases: pytext.data.utils.VocabBuilder

class pytext.data.masked_util.MaskingFunction(use_bos, use_eos)[source]

Bases: pytext.config.component.Component

classmethod from_config(config, use_bos, use_eos)[source]
gen_masked_source_target(tokens, *args, **kwargs)[source]
should_mask(*args, **kwargs) → bool[source]
class pytext.data.masked_util.NoOpMaskingFunction(seed: Optional[int], minimum_masks: int, use_bos: bool, use_eos: bool)[source]

Bases: pytext.data.masked_util.MaskingFunction

classmethod from_config(config: pytext.data.masked_util.NoOpMaskingFunction.Config, use_bos: bool, use_eos: bool)[source]
gen_masked_source_target(tokens: List[int], vocab: pytext.data.utils.Vocabulary)[source]
class pytext.data.masked_util.RandomizedMaskingFunction(seed: Optional[int], minimum_masks: int, use_bos: bool, use_eos: bool)[source]

Bases: pytext.data.masked_util.MaskingFunction

classmethod from_config(config: pytext.data.masked_util.RandomizedMaskingFunction.Config, use_bos: bool, use_eos: bool)[source]
gen_masked_source_target(tokens: List[int], vocab: pytext.data.utils.Vocabulary)[source]
class pytext.data.masked_util.TreeMask(accept_flat_intents_slots, factor, use_bos, use_eos)[source]

Bases: pytext.data.masked_util.MaskingFunction

clean_eos_bos(tokens)[source]
classmethod from_config(config, use_bos, use_eos)[source]
gen_masked_source_target(tokens: List[int], vocab: pytext.data.utils.Vocabulary)[source]
gen_masked_tree(node, mask_token, depth=1)[source]
should_mask(depth=1)[source]

pytext.data.packed_lm_data module

class pytext.data.packed_lm_data.PackedLMData(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, max_seq_len: int = 128, sort_key: Optional[str] = None, language: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True)[source]

Bases: pytext.data.data.Data

Special purpose Data object which assumes a single text tensorizer. Packs tokens into a square batch with no padding. Used for LM training. The object also takes in an optional language argument which is used for cross-lingual LM training.

classmethod from_config(config: pytext.data.packed_lm_data.PackedLMData.Config, schema: Dict[str, Type[CT_co]], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], language: Optional[str] = None, rank: int = 0, world_size: int = 1, init_tensorizers: Optional[bool] = True)[source]
numberize_rows(rows)[source]

pytext.data.roberta_tensorizer module

class pytext.data.roberta_tensorizer.RoBERTaTensorizer(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBase

classmethod from_config(config: pytext.data.roberta_tensorizer.RoBERTaTensorizer.Config, **kwargs)[source]
class pytext.data.roberta_tensorizer.RoBERTaTokenLevelTensorizer(columns, tokenizer=None, vocab=None, max_seq_len=256, labels_columns=['label'], labels=[])[source]

Bases: pytext.data.roberta_tensorizer.RoBERTaTensorizer

Tensorizer for token level classification tasks such as NER, POS etc using RoBERTa. Here each token has an associated label and the tensorizer should output a label tensor as well. The input for this tensorizer comes from the CoNLLUNERDataSource data source.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.roberta_tensorizer.RoBERTaTokenLevelTensorizer.Config)[source]
numberize(row: Dict[KT, VT]) → Tuple[Any, ...][source]

Numberize both the tokens and labels. Since we break up tokens, the label for anything other than the first sub-word is assigned the padding idx.

tensorize(batch) → Tuple[torch.Tensor, ...][source]

Convert instance level vectors into batch level tensors.

torchscriptify()[source]

pytext.data.squad_for_bert_tensorizer module

class pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer(answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizer

Produces BERT inputs and answer spans for Squad.

SPAN_PAD_IDX = -100
classmethod from_config(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

numberize(row)[source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

tensorize(batch)[source]

Convert instance level vectors into batch level tensors.

class pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizerForKD(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]

Bases: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer

classmethod from_config(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizerForKD.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

numberize(row)[source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

tensorize(batch)[source]

Convert instance level vectors into batch level tensors.

class pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer(answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]

Bases: pytext.data.roberta_tensorizer.RoBERTaTensorizer, pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer

Produces RoBERTa inputs and answer spans for Squad.

classmethod from_config(config: pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

torchscriptify()[source]
class pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizerForKD(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]

Bases: pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer

classmethod from_config(config: pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizerForKD.Config, **kwargs)[source]

from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).

numberize(row)[source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

tensorize(batch)[source]

Convert instance level vectors into batch level tensors.

pytext.data.squad_tensorizer module

class pytext.data.squad_tensorizer.SquadTensorizer(doc_tensorizer: pytext.data.tensorizers.TokenTensorizer, ques_tensorizer: pytext.data.tensorizers.TokenTensorizer, doc_column: str = 'doc', ques_column: str = 'question', answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]

Bases: pytext.data.tensorizers.TokenTensorizer

Produces inputs and answer spans for Squad.

SPAN_PAD_IDX = -100
classmethod from_config(config: pytext.data.squad_tensorizer.SquadTensorizer.Config, **kwargs)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.squad_tensorizer.SquadTensorizerForKD(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]

Bases: pytext.data.squad_tensorizer.SquadTensorizer

classmethod from_config(config: pytext.data.squad_tensorizer.SquadTensorizerForKD.Config, **kwargs)[source]
numberize(row)[source]

Tokenize, look up in vocabulary.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

pytext.data.tensorizers module

class pytext.data.tensorizers.AnnotationNumberizer(column: str = 'seqlogical', vocab=None, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Not really a Tensorizer (since it does not create tensors) but technically serves the same function. This class parses Annotations in the format below and extracts the actions (type List[List[int]])

[IN:GET_ESTIMATED_DURATION How long will it take to [SL:METHOD_TRAVEL
drive ] from [SL:SOURCE Chicago ] to [SL:DESTINATION Mississippi ] ]

Extraction algorithm is handled by Annotation class. We only care about the list of actions, which before vocab index lookups would look like:

[
    IN:GET_ESTIMATED_DURATION, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT,
    SL:METHOD_TRAVEL, SHIFT, REDUCE,
    SHIFT,
    SL:SOURCE, SHIFT, REDUCE,
    SHIFT,
    SL:DESTINATION, SHIFT, REDUCE,
]
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.AnnotationNumberizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.ByteTensorizer(text_column, lower=True, max_seq_len=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Turn characters into sequence of int8 bytes. One character will have one or more bytes depending on it’s encoding

NUM = 256
PAD_BYTE = 0
UNK_BYTE = 0
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.ByteTensorizer.Config)[source]
numberize(row)[source]

Convert text to characters.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.ByteTokenTensorizer(text_column, tokenizer=None, max_seq_len=None, max_byte_len=15, offset_for_non_padding=0, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Turn words into 2-dimensional tensors of int8 bytes. Words are padded to max_byte_len. Also computes sequence lengths (1-D tensor) and token lengths (2-D tensor). 0 is the pad byte.

NUM_BYTES = 256
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.ByteTokenTensorizer.Config)[source]
numberize(row)[source]

Convert text to bytes, pad batch.

sort_key(row)[source]
tensorize(batch, pad_token=0)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.CharacterTokenTensorizer(max_char_length: int = 20, **kwargs)[source]

Bases: pytext.data.tensorizers.TokenTensorizer

Turn words into 2-dimensional tensors of ints based on their ascii values. Words are padded to the maximum word length (also capped at max_char_length). Sequence lengths are the length of each token, 0 for pad token.

initialize(from_scratch=True)

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]

Convert text to characters, pad batch.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.CharacterVocabTokenTensorizer(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Turn words into 2-dimensional tensors of ints based on the char vocab. Words are padded to the maximum word length (also capped at max_char_length). Sequence lengths are the length of each token.

The difference with pytext.data.tensorizers.CharacterTokenTensorizer is that the CharacterTokenTensorizer uses the ascii value and does not require to build a vocab. Here we tensorize based on the vocab.

character_tokenize(tokens: List[pytext.data.tokenizers.tokenizer.Token])[source]
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.CharacterVocabTokenTensorizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
class pytext.data.tensorizers.CharacterVocabTokenTensorizerScriptImpl(add_bos_token: bool, add_eos_token: bool, use_eos_token_for_bos: bool, max_seq_len: int, vocab: pytext.data.utils.Vocabulary, tokenizer: Optional[pytext.data.tokenizers.tokenizer.Tokenizer])[source]

Bases: pytext.data.tensorizers.TensorizerScriptImpl

forward(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_texts_by_index(texts: Optional[List[List[str]]], index: int) → Optional[str][source]
get_tokens_by_index(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[str]][source]
numberize(char_tokens: List[List[str]], char_tokens_lengths: List[int]) → Tuple[List[List[int]], List[int]][source]

This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

tensorize(tokens: List[List[List[int]]], tokens_lengths: List[List[int]]) → Tuple[torch.Tensor, torch.Tensor][source]

This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

tokenize(row_text: Optional[str] = None, row_pre_tokenized: Optional[List[str]] = None) → Tuple[List[List[str]], List[int]][source]

This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

class pytext.data.tensorizers.Float1DListTensorizer(config: pytext.data.tensorizers.Float1DListTensorizer.Config, **kwargs)[source]

Bases: pytext.data.tensorizers.Tensorizer

Tensorizes the 1d list of floats – List[float] TODO: Even though very similar, ‘FloatListTensorizer’ currently does not support this vanilla case for tensorization of List[float]. In future, if ‘FloatListTensorizer’ accommodates this case, we do not need this separate tensorizer.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.Float1DListTensorizer.Config, **kwargs)[source]
initialize(from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
class pytext.data.tensorizers.FloatListSeqTensorizer(column: str, error_check: bool, dim: Optional[int], pad_token: float = -1.0, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize numeric labels.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.FloatListSeqTensorizer.Config)[source]
numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
class pytext.data.tensorizers.FloatListTensorizer(column: str, error_check: bool, dim: Optional[int], normalize: bool, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize numeric labels.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.FloatListTensorizer.Config)[source]
initialize()[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.FloatTensorizer(column: str, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

A tensorizer for reading in scalars from the data.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.FloatTensorizer.Config)[source]
numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.GazetteerTensorizer(text_column: str = 'text', dict_column: str = 'dict', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Create 3 tensors for dict features.

  • idx: index of feature in token order.
  • weights: weight of feature in token order.
  • lens: number of features per token.

For each input token, there will be the same number of idx and weights entries. (equal to the max number of features any token has in this row). The values in lens will tell how many of these features are actually used per token.

Input format for the dict column is json and should be a list of dictionaries containing the “features” and their weight for each relevant “tokenIdx”. Example:

text: "Order coffee from Starbucks please"
dict: [
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

if we assume this vocab

vocab = {
    UNK: 0, PAD: 1,
    "drink/beverage": 2, "music/song": 3, "store/coffee_shop": 4
}

this example will result in those tensors:

idx =     [1,   1,   2,   3,   1,   1,   4,   1,   1,   1]
weights = [0.0, 0.0, 0.8, 0.2, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]
lens =    [1,        2,        1,        1,        1]
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.GazetteerTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all dict features to create vocab.

numberize(row)[source]

Numberize dict features. Fill in for tokens with no features with PAD and weight 0.0. All tokens need to have at least one entry. Tokens with more than one feature will have multiple idx and weight added in sequence.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.Integer1DListTensorizer(config: pytext.data.tensorizers.Integer1DListTensorizer.Config, **kwargs)[source]

Bases: pytext.data.tensorizers.Tensorizer

Tensorizes the 1d list of integers – List[int]

SPAN_PAD_IDX = 0
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.Integer1DListTensorizer.Config, **kwargs)[source]
initialize(from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
class pytext.data.tensorizers.LabelListRankTensorizer(*args, pad_missing: bool = False, **kwargs)[source]

Bases: pytext.data.tensorizers.LabelTensorizer

LabelListRankTensorizer takes a list of a single array with [[labelA, rankA], [labelB, rankB], …] as input and generate a tuple of tensors (label_idx, list_length). Example: Input: [“[“weather”,”1”]”,”[“business”,”1”]”] Output of size len(vocab) {“timer”, “weather”, “business”} => [0, 1, 1]. This would suggest both labels are of equal rank.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.LabelListRankTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all labels and create a vocab map for them.

numberize(row)[source]

Numberize labels.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.LabelListTensorizer(*args, pad_missing: bool = False, **kwargs)[source]

Bases: pytext.data.tensorizers.LabelTensorizer

LabelListTensorizer takes a list of labels as input and generate a tuple of tensors (label_idx, list_length).

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.LabelListTensorizer.Config)[source]
numberize(row)[source]

Numberize labels.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.LabelTensorizer(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, label_vocab_file: Optional[str] = None, is_input: bool = False, add_labels: Optional[List[str]] = None)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize labels. Label can be used as either input or target.

NB: if the labels are used as targets for binary classification with a loss such as cosine distance, the order of the label_vocab does matter, and it should be [negative_class, positive_class].

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.LabelTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all labels and create a vocab map for them.

numberize(row)[source]

Numberize labels.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.MetricTensorizer(names: List[str], indexes: List[int], is_input: bool = False)[source]

Bases: pytext.data.tensorizers.Tensorizer

A tensorizer which use other tensorizers’ numerized data. Used mostly for metric reporting.

classmethod from_config(config: pytext.data.tensorizers.MetricTensorizer.Config)[source]
numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.NtokensTensorizer(names: List[str], indexes: List[int], is_input: bool = False)[source]

Bases: pytext.data.tensorizers.MetricTensorizer

A tensorizer which will reference another tensorizer’s numerized data to calculate the num tokens. Used for calculating tokens per second.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.NumericLabelTensorizer(label_column: str = 'label', rescale_range: Optional[List[float]] = None, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize numeric labels.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.NumericLabelTensorizer.Config)[source]
numberize(row)[source]

Numberize labels.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.SeqTokenTensorizer(column: str = 'text_seq', tokenizer=None, add_bos_token: bool = False, add_eos_token: bool = False, use_eos_token_for_bos: bool = False, add_bol_token: bool = False, add_eol_token: bool = False, use_eol_token_for_bol: bool = False, max_seq_len=None, vocab=None, is_input: bool = True, max_turn=50)[source]

Bases: pytext.data.tensorizers.Tensorizer

Tensorize a sequence of sentences. The input is a list of strings, like this one:

["where do you wanna meet?", "MPK"]

if we assume this vocab

vocab  {
  UNK: 0, PAD: 1,
  'where': 2, 'do': 3, 'you': 4, 'wanna': 5, 'meet?': 6, 'mpk': 7
}

this example will result in those tensors:

idx = [[2, 3, 4, 5, 6], [7, 1, 1, 1, 1]]
sentence_len = [5, 1]
seq_len = [2]

If you’re using BOS, EOS, BOL and EOL, the vocab will look like this

vocab  {
  UNK: 0, PAD: 1,  BOS: 2, EOS: 3, BOL: 4, EOL: 5
  'where': 6, 'do': 7, 'you': 8, 'wanna': 9, 'meet?': 10, 'mpk': 11
}

this example will result in those tensors:

idx = [
    [2,  4, 3, 1, 1,  1, 1],
    [2,  6, 7, 8, 9, 10, 3],
    [2, 11, 3, 1, 1,  1, 1],
    [2,  5, 3, 1, 1,  1, 1]
]
sentence_len = [3, 8, 3, 3]
seq_len = [4]
column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.SeqTokenTensorizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

prepare_input(row)[source]

Tokenize, return tokenized_texts in raw text

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.SlotLabelTensorizer(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize word/slot labels.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.SlotLabelTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all labels and create a vocab map for them.

numberize(row)[source]

Turn slot labels and text into a list of token labels with the same length as the number of tokens in the text.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.SlotLabelTensorizerExpansible(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.SlotLabelTensorizer

Create a base SlotLabelTensorizer to support selecting different types in ModelInput.

class pytext.data.tensorizers.SoftLabelTensorizer(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, probs_column: str = 'target_probs', logits_column: str = 'target_logits', labels_column: str = 'target_labels', label_vocab_file: Optional[str] = None, is_input: bool = False)[source]

Bases: pytext.data.tensorizers.LabelTensorizer

Handles numberizing labels for knowledge distillation. This still requires the same label column as LabelTensorizer for the “true” label, but also processes soft “probabilistic” labels generated from a teacher model, via three new columns.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.SoftLabelTensorizer.Config)[source]
numberize(row)[source]

Numberize hard and soft labels

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.String2DListTensorizer(column, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.String2DListTensorizer.Config)[source]
initialize(from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
class pytext.data.tensorizers.String2DListTensorizerScriptImpl(vocab: pytext.data.utils.Vocabulary)[source]

Bases: pytext.data.tensorizers.TensorizerScriptImpl

forward(inputs: List[List[List[str]]]) → Tuple[torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

numberize(tokens: List[List[str]]) → Tuple[List[List[int]], List[int], int][source]

This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

tensorize(tokens_3d: List[List[List[int]]], seq_lens_2d: List[List[int]], seq_lens_1d: List[int]) → Tuple[torch.Tensor, torch.Tensor][source]

This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

class pytext.data.tensorizers.Tensorizer(is_input: bool = True)[source]

Bases: pytext.config.component.Component

Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.

Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.Tensorizer.Config)[source]
initialize(from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
prepare_input(row)[source]

Return preprocessed input tensors/blob for caffe2 prediction net.

sort_key(row)[source]
stringify(token_indices)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
torchscriptify()[source]
class pytext.data.tensorizers.TensorizerScriptImpl[source]

Bases: torch.nn.modules.module.Module

batch_size(inputs: pytext.torchscript.utils.ScriptBatchInput) → int[source]
get_texts_by_index(texts: Optional[List[List[str]]], index: int) → Optional[List[str]][source]
get_tokens_by_index(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[List[str]]][source]
numberize(*args, **kwargs)[source]

This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

row_size(inputs: pytext.torchscript.utils.ScriptBatchInput) → int[source]
set_device(device: str)[source]
set_padding_control(dimension: str, padding_control: Optional[List[int]])[source]

This functions will be called to set a padding style. None - No padding List: first element 0, round seq length to the smallest list element larger than inputs

tensorize(*args, **kwargs)[source]

This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

tensorize_wrapper(*args, **kwargs)[source]

This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.

It will be called in PyText Tensorizer during training time, this function is not torchscriptiable because it depends on cuda.device().

tokenize(*args, **kwargs)[source]

This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

torchscriptify()[source]
class pytext.data.tensorizers.TokenTensorizer(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Convert text to a list of tokens. Do this based on a tokenizer configuration, and build a vocabulary for numberization. Finally, pad the batch to create a square tensor of the correct size.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.TokenTensorizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize, look up in vocabulary.

prepare_input(row)[source]

Tokenize, look up in vocabulary, return tokenized_texts in raw text

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.UidTensorizer(uid_column: str = 'uid', allow_unknown: bool = True, is_input: bool = True)[source]

Bases: pytext.data.tensorizers.Tensorizer

Numberize user IDs which can be either strings or tensors.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.UidTensorizer.Config)[source]
initialize(from_scratch=True)[source]

Look through the dataset for all uids and create a vocab map for them.

numberize(row)[source]

Numberize uids.

tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

class pytext.data.tensorizers.VocabConfig(**kwargs)[source]

Bases: pytext.config.component.Component.Config

build_from_data = True

Whether to add tokens from training data to vocab.

min_counts = 0

Add min_counts filter out tokens in training data that with count smaller than min_counts.

size_from_data = 0

Add size_from_data most frequent tokens in training data to vocab (if this is 0, add all tokens from training data).

vocab_files = []
class pytext.data.tensorizers.VocabFileConfig(**kwargs)[source]

Bases: pytext.config.component.Component.Config

filepath = ''

File containing tokens to add to vocab (first whitespace-separated entry per line)

lowercase_tokens = False

Whether to lowercase each of the tokens in the file

size_limit = 0

The max number of tokens to add to vocab

skip_header_line = False

Whether to skip the first line of the file (e.g. if it is a header line)

pytext.data.tensorizers.initialize_tensorizers(tensorizers, data_source, from_scratch=True)[source]

A utility function to stream a data source to the initialize functions of a dict of tensorizers.

pytext.data.tensorizers.lookup_tokens(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, vocab: pytext.data.utils.Vocabulary = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]
pytext.data.tensorizers.to_device(tensorizer_script_impl, device)[source]
pytext.data.tensorizers.tokenize(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]

pytext.data.token_tensorizer module

class pytext.data.token_tensorizer.ScriptBasedTokenTensorizer(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]

Bases: pytext.data.tensorizers.Tensorizer

An Implementation of TokenTensorizer that uses a TorchScript module in the background and is hence torchscriptifiable.

Note that unlike the original TokenTensorizer, this version cannot deal with arbitrarily nested lists of tokens.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.token_tensorizer.ScriptBasedTokenTensorizer.Config)[source]
initialize(vocab_builder=None, from_scratch=True)[source]

Build vocabulary based on training corpus.

numberize(row)[source]

Tokenize and look up in vocabulary.

A few notable things:

1) We’re using the non-torchsciptified tokenizer here. This allows us to use non-torchscriptifiable tokenizers if we don’t intend to torchscriptify this module.

2) When using the ScriptImpl to do the lookup, it takes care of the BOS / EOS stuff there. Hence we don’t need to do that with the tokenizer.

3) The tokenize function from tensorizer.py returns a tuple of (tokens, start_indices, end_indices), while the ScriptImpl expects a list of (token, start_idx, end_idx) tuples so we need to unzip these

prepare_input(row)[source]

Tokenize, look up in vocabulary, return tokenized_texts in raw text

Similarly to the above function, tokenization is done with the original and not the torchscriptified tokenizer.

sort_key(row)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
class pytext.data.token_tensorizer.TokenTensorizerScriptImpl(add_bos_token: bool, add_eos_token: bool, use_eos_token_for_bos: bool, max_seq_len: int, vocab: pytext.data.utils.Vocabulary, tokenizer: Optional[pytext.data.tokenizers.tokenizer.Tokenizer])[source]

Bases: pytext.data.tensorizers.TensorizerScriptImpl

forward(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_texts_by_index(texts: Optional[List[List[str]]], index: int) → Optional[str][source]
get_tokens_by_index(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[str]][source]
numberize(text_tokens: List[Tuple[str, int, int]]) → Tuple[List[int], int, List[Tuple[int, int]]][source]

This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

tensorize(tokens_2d: List[List[int]], seq_lens_1d: List[int], positions_2d: List[List[Tuple[int, int]]]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]

This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

tokenize(row_text: Optional[str], row_pre_tokenized: Optional[List[str]]) → List[Tuple[str, int, int]][source]

This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]

Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.

pytext.data.utils module

class pytext.data.utils.VocabBuilder(delimiter=' ')[source]

Bases: object

Helper class for aggregating and building Vocabulary objects.

add(value) → None[source]

Count a single value in the vocabulary.

add_all(values) → None[source]

Count a value or nested container of values in the vocabulary.

add_from_file(file_pointer, skip_header_line, lowercase_tokens, size)[source]
has_added_tokens()[source]
make_vocab() → pytext.data.utils.Vocabulary[source]

Build a Vocabulary object from the values seen by the builder.

truncate_to_vocab_size(vocab_size=-1, min_counts=-1) → None[source]
class pytext.data.utils.Vocabulary(vocab_list: List[str], counts: List[T] = None, replacements: Optional[Dict[str, str]] = None, unk_token: str = '__UNKNOWN__', pad_token: str = '__PAD__', bos_token: str = '__BEGIN_OF_SENTENCE__', eos_token: str = '__END_OF_SENTENCE__', mask_token: str = '__MASK__')[source]

Bases: object

A mapping from indices to vocab elements.

get_bos_index(value=None)[source]
get_eos_index(value=None)[source]
get_mask_index(value=None)[source]
get_pad_index(value=None)[source]
get_unk_index(value=None)[source]
lookup_all(nested_values)[source]
lookup_all_internal(nested_values)[source]

Look up a value or nested container of values in the vocab index. The return value will have the same shape as the input, with all values replaced with their respective indicies.

replace_tokens(replacements)[source]

Replace tokens in vocab with given replacement. Used for replacing special strings for special tokens. e.g. ‘[UNK]’ for UNK

pytext.data.utils.align_target_label(targets: List[float], labels: List[str], label_vocab: Dict[str, int]) → List[float][source]

Given targets that are ordered according to labels, align the targets to match the order of label_vocab.

pytext.data.utils.align_target_labels(targets_list: List[List[float]], labels_list: List[List[str]], label_vocab: Dict[str, int]) → List[List[float]][source]

Given targets_list that are ordered according to labels_list, align the targets to match the order of label_vocab.

pytext.data.utils.pad(nested_lists, pad_token, pad_shape=None)[source]

Pad the input lists with the pad token. If pad_shape is provided, pad to that shape, otherwise infer the input shape and pad out to a square tensor shape.

pytext.data.utils.pad_and_tensorize(batch, pad_token=0, pad_shape=None, dtype=torch.int64)[source]
pytext.data.utils.shard(rows, rank, num_workers)[source]

Only return every num_workers example for distributed training.

pytext.data.utils.should_iter(i)[source]

Whether or not an object looks like a python iterable (not including strings).

pytext.data.xlm_constants module

pytext.data.xlm_dictionary module

class pytext.data.xlm_dictionary.Dictionary(id2word, word2id, counts)[source]

Bases: object

check_valid()[source]

Check that the dictionary is valid.

index(word, no_unk=False)[source]

Returns the index of the specified word.

static index_data(path, bin_path, dico)[source]

Index sentences with a dictionary.

max_vocab(max_vocab)[source]

Limit the vocabulary size.

min_count(min_count)[source]

Threshold on the word frequency counts.

static read_vocab(vocab_path)[source]

Create a dictionary from a vocabulary file.

pytext.data.xlm_tensorizer module

class pytext.data.xlm_tensorizer.XLMTensorizer(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, language_column: str = 'language', lang2id: Dict[str, int] = {'ar': 0, 'bg': 1, 'de': 2, 'el': 3, 'en': 4, 'es': 5, 'fr': 6, 'hi': 7, 'ru': 8, 'sw': 9, 'th': 10, 'tr': 11, 'ur': 12, 'vi': 13, 'zh': 14}, use_language_embeddings: bool = True, has_language_in_data: bool = False)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBase

Tensorizer for Cross-lingual LM tasks. Works for single sentence as well as sentence pair.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.xlm_tensorizer.XLMTensorizer.Config)[source]
get_lang_id(row: Dict[KT, VT], col: str) → int[source]
numberize(row: Dict[KT, VT]) → Tuple[Any, ...][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

tensorizer_script_impl = None
class pytext.data.xlm_tensorizer.XLMTensorizerScriptImpl(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int, language_vocab: List[str], default_language: str)[source]

Bases: pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl

forward(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]

Wire up tokenize(), numberize() and tensorize() functions for data processing.

numberize(per_sentence_tokens: List[List[Tuple[str, int, int]]], per_sentence_languages: List[int]) → Tuple[List[int], List[int], int, List[int]][source]

This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.

Parameters:
  • per_sentence_tokens – list of tokens per sentence level in one row,
  • token represented by token string, start and end indices. (each) –
Returns:

List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions

Return type:

tokens

Module contents

class pytext.data.AlternatingRandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.RandomizedBatchSampler

This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.AlternatingRandomizedBatchSampler.Config)[source]
class pytext.data.Batcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]

Bases: pytext.config.component.Component

Batcher designed to batch rows of data, before padding.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

Group rows by batch_size. Assume iterable of dicts, yield dict of lists. The last batch will be of length len(iterable) % batch_size.

classmethod from_config(config: pytext.data.data.Batcher.Config)[source]
class pytext.data.BaseBatchSampler[source]

Bases: pytext.config.component.Component

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.config.component.Component.Config)[source]
class pytext.data.BatchIterator(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]

Bases: object

BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.

Parameters:
  • batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
  • processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
  • include_input (bool) – if input data should be returned, default is true
  • include_target (bool) – if target data should be returned, default is true
  • include_context (bool) – if context data should be returned, default is true
  • is_train (bool) – if the batch data is for training
  • num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
class pytext.data.CommonMetadata[source]

Bases: object

class pytext.data.Data(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = True, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]

Bases: pytext.config.component.Component

Data is an abstraction that handles all of the following:

  • Initialize model metadata parameters
  • Create batches of tensors for model training or prediction

It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.

The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.

add_row_indices(rows)[source]
batches(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]

Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.

stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.

Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.

cache(numberized_rows, stage)[source]
classmethod from_config(config: pytext.data.data.Data.Config, schema: Dict[str, Type[CT_co]], tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], rank=0, world_size=1, init_tensorizers=True, **kwargs)[source]
numberize_rows(rows)[source]
class pytext.data.DataHandler(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]

Bases: pytext.config.component.Component

DataHandler is the central place to prepare data for model training/testing. The class is responsible of:

  • Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
  • Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata

The data processing pipeline contains the following steps:

  • Read data from file into a list of raw data examples
  • Convert each row of row data to a TorchText Example. This logic happens in process_row function and will:
    • Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
    • Use the raw data and results from featurizer to do any preprocessing
  • Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
  • Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
raw_columns

columns to read from data source. The order should match the data stored in that file.

Type:List[str]
featurizer

perform data preprocessing that should be shared between training and inference

Type:Featurizer
features

a dict of name -> field that used to process data as model input

Type:Dict[str, Field]
labels

a dict of name -> field that used to process data as training target

Type:Dict[str, Field]
extra_fields

fields that process any extra data used neither as model input nor target. This is None by default

Type:Dict[str, Field]
text_feature_name

name of the text field, used to define the default sort key of data

Type:str
shuffle

if the dataset should be shuffled, true by default

Type:bool
sort_within_batch

if data within same batch should be sorted, true by default

Type:bool
train_path

path of training data file

Type:str
eval_path

path of evaluation data file

Type:str
test_path

path of test data file

Type:str
train_batch_size

training batch size, 128 by default

Type:int
eval_batch_size

evaluation batch size, 128 by default

Type:int
test_batch_size

test batch size, 128 by default

Type:int
max_seq_len

maximum length of tokens to keep in sequence

Type:int
pass_index

if the original index of data in the batch should be passed along to downstream steps, default is true

Type:bool
gen_dataset(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.legacy.data.dataset.Dataset[source]

Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)

gen_dataset_from_path(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.legacy.data.dataset.Dataset[source]

Generate a dataset from file :returns: dataset (TorchText.Dataset)

get_eval_iter()[source]
get_predict_iter(data: Iterable[Dict[str, Any]], batch_size: Optional[int] = None)[source]
get_test_iter()[source]
get_test_iter_from_path(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_test_iter_from_raw_data(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1)[source]
get_train_iter_from_path(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]

Generate data batch iterator for training data. See _get_train_iter() for details

Parameters:
  • train_path (str) – file path of training data
  • batch_size (int) – batch size
  • rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
  • world_size (int) – used for distributed training, total number of Gpu
get_train_iter_from_raw_data(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]
init_feature_metadata(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]
init_metadata()[source]

Initialize metadata using data from configured path

init_metadata_from_path(train_path, eval_path, test_path)[source]

Initialize metadata using data from file

init_metadata_from_raw_data(*data)[source]

Initialize metadata using in memory data

init_target_metadata(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]
load_metadata(metadata: pytext.data.data_handler.CommonMetadata)[source]

Load previously saved metadata

load_vocab(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]

Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.

Parameters:
  • vocab_file (str) – vocab file to load
  • vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
  • lowercase_tokens (bool) – if the tokens should be lowercased
metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

preprocess(data: Iterable[Dict[str, Any]])[source]

preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])

preprocess_row(row_data: Dict[str, Any]) → Dict[str, Any][source]

preprocess steps for a single input row, sub class should override it

read_from_file(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]

Read data from csv file. Input file format is required to be tab-separated columns

Parameters:
  • file_name (str) – csv file name
  • columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
sort_key(example: torchtext.legacy.data.example.Example) → Any[source]

How to sort data in every batch, default behavior is by the length of input text :param example: one torchtext example :type example: Example

class pytext.data.DisjointMultitaskData(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]

Bases: pytext.data.data.Data

Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskData.Config.
  • data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_dict

Data handlers to do roundrobin over.

Type:type
batches(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]

Yield batches from each task, sampled according to a given sampler. This batcher additionally exposes a task name in the batch to allow the model to filter examples to the appropriate tasks.

classmethod from_config(config: pytext.data.disjoint_multitask_data.DisjointMultitaskData.Config, data_dict: Dict[str, pytext.data.data.Data], task_key: str = 'task_name', rank=0, world_size=1, init_tensorizers=True)[source]
class pytext.data.DisjointMultitaskDataHandler(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]

Bases: pytext.data.data_handler.DataHandler

Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.

Parameters:
  • config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
  • data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
  • target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
  • *args (type) – Extra arguments to be passed down to sub data handlers.
  • **kwargs (type) – Extra arguments to be passed down to sub data handlers.
data_handlers

Data handlers to do roundrobin over.

Type:type
target_task_name

Used to select best epoch, and set batch_per_epoch.

Type:type
upsample

If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.

Type:bool
get_eval_iter() → pytext.data.data_handler.BatchIterator[source]
get_test_iter() → pytext.data.data_handler.BatchIterator[source]
get_train_iter(rank: int = 0, world_size: int = 1) → Tuple[pytext.data.data_handler.BatchIterator, ...][source]
init_metadata()[source]

Initialize metadata using data from configured path

load_metadata(metadata)[source]

Load previously saved metadata

metadata_to_save()[source]

Save metadata, pretrained_embeds_weight should be excluded

class pytext.data.DynamicPoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]

Bases: pytext.data.data.PoolingBatcher

Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
compute_dynamic_batch_size(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]
finished_dynamic() → bool[source]
classmethod from_config(config: pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
step_epoch()[source]
class pytext.data.EvalBatchSampler[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

Output: [A, B, C, D, a, b]

batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key.

Parameters:iterators – Dictionary of iterators
pytext.data.generator_iterator(fn)[source]

Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.

class pytext.data.PoolingBatcher(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1)[source]

Bases: pytext.data.data.Batcher

Batcher that shuffles and (if requested) sorts data.

Rationale

There is a trade-off between having batches of data that are truly randomly shuffled, and batches of data that are efficiently padded. If we wanted to maximise the efficiency of padding (i.e. minimise the amount of padding that is needed), we would have to enforce that all inputs of a similar length appear in the same batch. This however would lead to a dramatic decrease in the randomness of batches. On the other end of the spectrum, if we wanted to maximise randomness, we would often end up with inputs of wildly different lengths in the same batch, which would lead to a lot of padding.

Operation

This batcher uses a multi-staged approach.

  1. It first loads a number of “pools” of data, and shuffles them (this is controlled by num_shuffled_pools).
  2. It then splits up the shuffled data sequentially into individual pools, and the examples within each pool are sorted (if requested).
  3. Finally, each pool is split up sequentially into batches, and yielded. If sorting was requested in step #2, the order in which the batches are yielded is randomised.

The size of a pool is expressed as a multiple of the batch size, and is controlled by pool_num_batches.

Examples

Assuming sorting is enabled, with the default settings of pool_num_batches: 1000 and num_shuffled_pools: 1, a pool of 1k * batch_size examples is loaded, sorted by length, and split up into 1k batches. These batches are then yielded in random order. Once they run out, a new pool is loaded, and the process is repeated. An advantage of this approach is that padding will be somewhat reduced. A disadvantage is that, for every epoch, the first 1k batches will be always the same (albeit in a different order).

On the other hand, specifying pool_num_batches: 1000 and num_shuffled_pools: 1000 would achieve the following: 1k * 1k * batch_size examples are loaded, and shuffled. These are then split up into pools of size 1k * batch_size, which are then sorted internally, split into individual batches, and yielded in random order. Compared to the previous example, we no longer have the problem that the first 1k batches are always the same in each epoch, but we’ve had to load in memory 1M examples.

batchify(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]

From an iterable of dicts, yield dicts of lists:

  1. Load num_shuffled_pools pools of data, and shuffle them.
  2. Load a pool (batch_size * pool_num_batches examples).
  3. Sort rows, if necessary.
  4. Shuffle the order in which the batches are returned, if necessary.
classmethod from_config(config: pytext.data.data.PoolingBatcher.Config)[source]
get_batch_size(stage: pytext.common.constants.Stage) → int[source]
class pytext.data.RandomizedBatchSampler(unnormalized_iterator_probs: Dict[str, float])[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.

Example

Iterator A: [A, B, C, D], Iterator B: [a, b]

batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]

Parameters:unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.RandomizedBatchSampler.Config)[source]
class pytext.data.RoundRobinBatchSampler(iter_to_set_epoch: Optional[str] = None)[source]

Bases: pytext.data.batch_sampler.BaseBatchSampler

This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.

If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.

If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.

Example

Iterator 1: [A, B, C, D], Iterator 2: [a, b]

iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]

iter_to_set_epoch = None Output: [A, a, B, b]

Parameters:iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
batchify(iterators: Dict[str, collections.abc.Iterator])[source]

Loop through each key in the input dict and generate batches from the iterator associated with that key until the target iterator reaches its end.

Parameters:iterators – Dictionary of iterators
classmethod from_config(config: pytext.data.batch_sampler.RoundRobinBatchSampler.Config)[source]
class pytext.data.NaturalBatchSampler(dataset_counts: Dict[str, int])[source]

Bases: pytext.data.batch_sampler.RandomizedBatchSampler

This sampler iterates over all the datasets, sampling according to the weighted number of samples in each dataset.

batchify(iterators: Dict[str, collections.abc.Iterator])[source]
classmethod from_config(config: pytext.data.batch_sampler.NaturalBatchSampler.Config)[source]
class pytext.data.Tensorizer(is_input: bool = True)[source]

Bases: pytext.config.component.Component

Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.

Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.

column_schema

Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.

classmethod from_config(config: pytext.data.tensorizers.Tensorizer.Config)[source]
initialize(from_scratch=True)[source]

The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:

# set up variables here
...
try:
    # start reading through data source
    while True:
        # row has type Dict[str, types.DataType]
        row = yield
        # update any variables, vocabularies, etc.
        ...
except GeneratorExit:
    # finalize your initialization, set instance variables, etc.
    ...

See WordTokenizer.initialize for a more concrete example.

numberize(row)[source]
prepare_input(row)[source]

Return preprocessed input tensors/blob for caffe2 prediction net.

sort_key(row)[source]
stringify(token_indices)[source]
tensorize(batch)[source]

Tensorizer knows how to pad and tensorize a batch of it’s own output.

tensorizer_script_impl = None
torchscriptify()[source]