pytext.data package¶
Subpackages¶
- pytext.data.data_structures package
- pytext.data.featurizer package
- pytext.data.sources package
- pytext.data.test package
- Submodules
- pytext.data.test.batch_sampler_test module
- pytext.data.test.data_test module
- pytext.data.test.dynamic_pooling_batcher_test module
- pytext.data.test.mask_tensorizers_test module
- pytext.data.test.pandas_data_source_test module
- pytext.data.test.round_robin_batchiterator_test module
- pytext.data.test.simple_featurizer_test module
- pytext.data.test.tensorizers_test module
- pytext.data.test.tokenizers_test module
- pytext.data.test.tsv_data_source_test module
- pytext.data.test.utils_test module
- Module contents
- pytext.data.tokenizers package
Submodules¶
pytext.data.batch_sampler module¶
-
class
pytext.data.batch_sampler.
AlternatingRandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.RandomizedBatchSampler
This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.
-
class
pytext.data.batch_sampler.
EvalBatchSampler
[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
Output: [A, B, C, D, a, b]
-
class
pytext.data.batch_sampler.
NaturalBatchSampler
(dataset_counts: Dict[str, int])[source]¶ Bases:
pytext.data.batch_sampler.RandomizedBatchSampler
This sampler iterates over all the datasets, sampling according to the weighted number of samples in each dataset.
-
class
pytext.data.batch_sampler.
RandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.
Example
Iterator A: [A, B, C, D], Iterator B: [a, b]
batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]
Parameters: unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
-
class
pytext.data.batch_sampler.
RoundRobinBatchSampler
(iter_to_set_epoch: Optional[str] = None)[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.
If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.
If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]
iter_to_set_epoch = None Output: [A, a, B, b]
Parameters: iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
-
pytext.data.batch_sampler.
extract_iterator_properties
(input_iterator_probs: Dict[str, float])[source]¶ Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to generate iterator properties: iterator_names and iterator_probs.
-
pytext.data.batch_sampler.
select_key_and_batch
(iterator_names: Dict[str, str], iterator_probs: Dict[str, float], iter_dict: Dict[str, collections.abc.Iterator], iterators: Dict[str, collections.abc.Iterator])[source]¶ Helper function for RandomizedBatchSampler and AlternatingRandomizedBatchSampler to select a key from iterator_names using iterator_probs and return a batch for the selected key using iter_dict and iterators.
pytext.data.bert_tensorizer module¶
-
class
pytext.data.bert_tensorizer.
BERTTensorizer
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBase
Tensorizer for BERT tasks. Works for single sentence, sentence pair, triples etc.
-
classmethod
from_config
(config: pytext.data.bert_tensorizer.BERTTensorizer.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
-
class
pytext.data.bert_tensorizer.
BERTTensorizerBase
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, base_tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Base Tensorizer class for all BERT style models including XLM, RoBERTa and XLM-R.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(vocab_builder=None, from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
numberize
(row: Dict[KT, VT]) → Tuple[Any, ...][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
-
tensorize
(batch) → Tuple[torch.Tensor, ...][source]¶ Convert instance level vectors into batch level tensors.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.bert_tensorizer.
BERTTensorizerBaseScriptImpl
(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]¶ Bases:
pytext.data.tensorizers.TensorizerScriptImpl
-
forward
(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Wire up tokenize(), numberize() and tensorize() functions for data processing. When export to TorchScript, the wrapper module should choose to use texts or pre_tokenized based on the TorchScript tokenizer implementation (e.g use external tokenizer such as Yoda or not).
-
numberize
(per_sentence_tokens: List[List[Tuple[str, int, int]]]) → Tuple[List[int], List[int], int, List[int]][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
Parameters: - per_sentence_tokens – list of tokens per sentence level in one row,
- token represented by token string, start and end indices. (each) –
Returns: List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions
Return type: tokens
-
tensorize
(tokens_2d: List[List[int]], segment_labels_2d: List[List[int]], seq_lens_1d: List[int], positions_2d: List[List[int]]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Convert instance level vectors into batch level tensors.
-
tokenize
(row_text: Optional[List[str]], row_pre_tokenized: Optional[List[List[str]]]) → List[List[Tuple[str, int, int]]][source]¶ This function convert raw inputs into tokens, each token is represented by token(str), start and end indices in the raw inputs. There are two possible inputs to this function depends if the tokenized in implemented in TorchScript or not.
Case 1: Tokenizer has a full TorchScript implementation, the input will be a list of sentences (in most case it is single sentence or a pair).
Case 2: Tokenizer have partial or no TorchScript implementation, in most case, the tokenizer will be host in Yoda, the input will be a list of pre-processed tokens.
Returns: tokens per sentence level, each token is represented by token(str), start and end indices. Return type: per_sentence_tokens
-
-
class
pytext.data.bert_tensorizer.
BERTTensorizerScriptImpl
(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl
-
pytext.data.bert_tensorizer.
build_fairseq_vocab
(vocab_file: str, dictionary_class: fairseq.data.dictionary.Dictionary = <class 'fairseq.data.dictionary.Dictionary'>, special_token_replacements: Dict[str, pytext.common.constants.Token] = None, max_vocab: int = -1, min_count: int = -1, tokens_to_add: Optional[List[str]] = None) → pytext.data.utils.Vocabulary[source]¶ Function builds a PyText vocabulary for models pre-trained using Fairseq modules. The dictionary class can take any Fairseq Dictionary class and is used to load the vocab file.
pytext.data.data module¶
-
class
pytext.data.data.
BatchData
(raw_data, numberized)[source]¶ Bases:
tuple
-
numberized
¶ Alias for field number 1
-
raw_data
¶ Alias for field number 0
-
-
class
pytext.data.data.
Batcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]¶ Bases:
pytext.config.component.Component
Batcher designed to batch rows of data, before padding.
-
class
pytext.data.data.
Data
(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = True, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]¶ Bases:
pytext.config.component.Component
Data is an abstraction that handles all of the following:
- Initialize model metadata parameters
- Create batches of tensors for model training or prediction
It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.
The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.
-
batches
(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]¶ Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.
stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.
Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.
-
class
pytext.data.data.
PoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1)[source]¶ Bases:
pytext.data.data.Batcher
Batcher that shuffles and (if requested) sorts data.
Rationale
There is a trade-off between having batches of data that are truly randomly shuffled, and batches of data that are efficiently padded. If we wanted to maximise the efficiency of padding (i.e. minimise the amount of padding that is needed), we would have to enforce that all inputs of a similar length appear in the same batch. This however would lead to a dramatic decrease in the randomness of batches. On the other end of the spectrum, if we wanted to maximise randomness, we would often end up with inputs of wildly different lengths in the same batch, which would lead to a lot of padding.
Operation
This batcher uses a multi-staged approach.
- It first loads a number of “pools” of data, and shuffles them (this is controlled by num_shuffled_pools).
- It then splits up the shuffled data sequentially into individual pools, and the examples within each pool are sorted (if requested).
- Finally, each pool is split up sequentially into batches, and yielded. If sorting was requested in step #2, the order in which the batches are yielded is randomised.
The size of a pool is expressed as a multiple of the batch size, and is controlled by pool_num_batches.
Examples
Assuming sorting is enabled, with the default settings of pool_num_batches: 1000 and num_shuffled_pools: 1, a pool of 1k * batch_size examples is loaded, sorted by length, and split up into 1k batches. These batches are then yielded in random order. Once they run out, a new pool is loaded, and the process is repeated. An advantage of this approach is that padding will be somewhat reduced. A disadvantage is that, for every epoch, the first 1k batches will be always the same (albeit in a different order).
On the other hand, specifying pool_num_batches: 1000 and num_shuffled_pools: 1000 would achieve the following: 1k * 1k * batch_size examples are loaded, and shuffled. These are then split up into pools of size 1k * batch_size, which are then sorted internally, split into individual batches, and yielded in random order. Compared to the previous example, we no longer have the problem that the first 1k batches are always the same in each epoch, but we’ve had to load in memory 1M examples.
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
class
pytext.data.data.
RowData
(raw_data, numberized)[source]¶ Bases:
tuple
-
numberized
¶ Alias for field number 1
-
raw_data
¶ Alias for field number 0
-
-
pytext.data.data.
generator_iterator
(fn)[source]¶ Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.
pytext.data.data_handler module¶
-
class
pytext.data.data_handler.
BatchIterator
(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]¶ Bases:
object
BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.
Parameters: - batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
- processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
- include_input (bool) – if input data should be returned, default is true
- include_target (bool) – if target data should be returned, default is true
- include_context (bool) – if context data should be returned, default is true
- is_train (bool) – if the batch data is for training
- num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
-
class
pytext.data.data_handler.
DataHandler
(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]¶ Bases:
pytext.config.component.Component
DataHandler is the central place to prepare data for model training/testing. The class is responsible of:
- Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
- Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata
The data processing pipeline contains the following steps:
- Read data from file into a list of raw data examples
- Convert each row of row data to a TorchText Example. This logic happens
in process_row function and will:
- Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
- Use the raw data and results from featurizer to do any preprocessing
- Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
- Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
-
raw_columns
¶ columns to read from data source. The order should match the data stored in that file.
Type: List[str]
-
featurizer
¶ perform data preprocessing that should be shared between training and inference
Type: Featurizer
-
features
¶ a dict of name -> field that used to process data as model input
Type: Dict[str, Field]
-
labels
¶ a dict of name -> field that used to process data as training target
Type: Dict[str, Field]
-
extra_fields
¶ fields that process any extra data used neither as model input nor target. This is None by default
Type: Dict[str, Field]
-
text_feature_name
¶ name of the text field, used to define the default sort key of data
Type: str
-
shuffle
¶ if the dataset should be shuffled, true by default
Type: bool
-
sort_within_batch
¶ if data within same batch should be sorted, true by default
Type: bool
-
train_path
¶ path of training data file
Type: str
-
eval_path
¶ path of evaluation data file
Type: str
-
test_path
¶ path of test data file
Type: str
-
train_batch_size
¶ training batch size, 128 by default
Type: int
-
eval_batch_size
¶ evaluation batch size, 128 by default
Type: int
-
test_batch_size
¶ test batch size, 128 by default
Type: int
-
max_seq_len
¶ maximum length of tokens to keep in sequence
Type: int
-
pass_index
¶ if the original index of data in the batch should be passed along to downstream steps, default is true
Type: bool
-
gen_dataset
(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.legacy.data.dataset.Dataset[source]¶ Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)
-
gen_dataset_from_path
(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.legacy.data.dataset.Dataset[source]¶ Generate a dataset from file :returns: dataset (TorchText.Dataset)
-
get_test_iter_from_path
(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_test_iter_from_raw_data
(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_train_iter_from_path
(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶ Generate data batch iterator for training data. See _get_train_iter() for details
Parameters: - train_path (str) – file path of training data
- batch_size (int) – batch size
- rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
- world_size (int) – used for distributed training, total number of Gpu
-
get_train_iter_from_raw_data
(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶
-
init_feature_metadata
(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]¶
-
init_metadata_from_path
(train_path, eval_path, test_path)[source]¶ Initialize metadata using data from file
-
init_target_metadata
(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]¶
-
load_metadata
(metadata: pytext.data.data_handler.CommonMetadata)[source]¶ Load previously saved metadata
-
load_vocab
(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]¶ Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.
Parameters: - vocab_file (str) – vocab file to load
- vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
- lowercase_tokens (bool) – if the tokens should be lowercased
-
preprocess
(data: Iterable[Dict[str, Any]])[source]¶ preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])
-
preprocess_row
(row_data: Dict[str, Any]) → Dict[str, Any][source]¶ preprocess steps for a single input row, sub class should override it
-
read_from_file
(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]¶ Read data from csv file. Input file format is required to be tab-separated columns
Parameters: - file_name (str) – csv file name
- columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
pytext.data.dense_retrieval_tensorizer module¶
-
class
pytext.data.dense_retrieval_tensorizer.
BERTContextTensorizerForDenseRetrieval
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizer
Methods numberize() and tensorize() implement https://fburl.com/an4fv7m1.
-
class
pytext.data.dense_retrieval_tensorizer.
PositiveLabelTensorizerForDenseRetrieval
(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, label_vocab_file: Optional[str] = None, is_input: bool = False, add_labels: Optional[List[str]] = None)[source]¶
-
class
pytext.data.dense_retrieval_tensorizer.
RoBERTaContextTensorizerForDenseRetrieval
(columns: List[str] = ['text'], vocab: Optional[pytext.data.utils.Vocabulary] = None, tokenizer: Optional[pytext.data.tokenizers.tokenizer.Tokenizer] = None, max_seq_len: int = 256)[source]¶ Bases:
pytext.data.dense_retrieval_tensorizer.BERTContextTensorizerForDenseRetrieval
,pytext.data.roberta_tensorizer.RoBERTaTensorizer
-
classmethod
from_config
(config: pytext.data.dense_retrieval_tensorizer.RoBERTaContextTensorizerForDenseRetrieval.Config)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
pytext.data.disjoint_multitask_data module¶
-
class
pytext.data.disjoint_multitask_data.
DisjointMultitaskData
(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]¶ Bases:
pytext.data.data.Data
Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskData.Config.
- data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_dict
¶ Data handlers to do roundrobin over.
Type: type
pytext.data.disjoint_multitask_data_handler module¶
-
class
pytext.data.disjoint_multitask_data_handler.
DisjointMultitaskDataHandler
(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]¶ Bases:
pytext.data.data_handler.DataHandler
Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
- data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
- target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_handlers
¶ Data handlers to do roundrobin over.
Type: type
-
target_task_name
¶ Used to select best epoch, and set batch_per_epoch.
Type: type
-
upsample
¶ If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.
Type: bool
-
class
pytext.data.disjoint_multitask_data_handler.
RoundRobinBatchIterator
(iterators: Dict[str, pytext.data.data_handler.BatchIterator], upsample: bool = True, iter_to_set_epoch: Optional[str] = None)[source]¶ Bases:
pytext.data.data_handler.BatchIterator
We take a dictionary of BatchIterators and do round robin over them in a cycle. The below describes the behavior for one epoch, with the example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
- If upsample is True:
If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.
iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]
If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.
Output: [A, a, B, b]
- If upsample is False:
Iterate over batches from one epoch of each iterator, with the order among iterators uniformly shuffled.
Possible output: [a, A, B, C, b, D]
Parameters: - iterators (Dict[str, BatchIterator]) – Iterators to do roundrobin over.
- upsample (bool) – If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, in random order. Evaluation will use upsample=False. Default True.
- iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If upsample is True and this is not set, epoch size defaults to the length of the shortest iterator. If upsample is False, this argument is not used.
-
iterators
¶ Iterators to do roundrobin over.
Type: Dict[str, BatchIterator]
-
upsample
¶ Whether to upsample iterators with fewer batches.
Type: bool
-
iter_to_set_epoch
¶ Name of iterator to define epoch size.
Type: str
pytext.data.dynamic_pooling_batcher module¶
-
class
pytext.data.dynamic_pooling_batcher.
BatcherSchedulerConfig
(**kwargs)[source]¶ Bases:
pytext.config.module_config.Module.Config
-
end_batch_size
= 256¶
-
epoch_period
= 10¶
-
start_batch_size
= 32¶
-
step_size
= 1¶
-
-
class
pytext.data.dynamic_pooling_batcher.
DynamicPoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]¶ Bases:
pytext.data.data.PoolingBatcher
Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
compute_dynamic_batch_size
(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]¶
-
-
class
pytext.data.dynamic_pooling_batcher.
ExponentialBatcherSchedulerConfig
(**kwargs)[source]¶ Bases:
pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig
-
gamma
= 5¶
-
-
class
pytext.data.dynamic_pooling_batcher.
ExponentialDynamicPoolingBatcher
(*args, **kwargs)[source]¶ Bases:
pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher
Exponential Dynamic Batch Scheduler: scales up batch size by a factor of gamma
-
class
pytext.data.dynamic_pooling_batcher.
LinearDynamicPoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]¶ Bases:
pytext.data.dynamic_pooling_batcher.DynamicPoolingBatcher
Linear Dynamic Batch Scheduler: scales up batch size linearly
pytext.data.masked_tensorizer module¶
-
class
pytext.data.masked_tensorizer.
MaskedTokenTensorizer
(text_column, mask, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]¶
pytext.data.masked_util module¶
-
class
pytext.data.masked_util.
NoOpMaskingFunction
(seed: Optional[int], minimum_masks: int, use_bos: bool, use_eos: bool)[source]¶
-
class
pytext.data.masked_util.
RandomizedMaskingFunction
(seed: Optional[int], minimum_masks: int, use_bos: bool, use_eos: bool)[source]¶
pytext.data.packed_lm_data module¶
-
class
pytext.data.packed_lm_data.
PackedLMData
(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, max_seq_len: int = 128, sort_key: Optional[str] = None, language: Optional[str] = None, in_memory: Optional[bool] = False, init_tensorizers: Optional[bool] = True)[source]¶ Bases:
pytext.data.data.Data
Special purpose Data object which assumes a single text tensorizer. Packs tokens into a square batch with no padding. Used for LM training. The object also takes in an optional language argument which is used for cross-lingual LM training.
pytext.data.roberta_tensorizer module¶
-
class
pytext.data.roberta_tensorizer.
RoBERTaTensorizer
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, **kwargs)[source]¶
-
class
pytext.data.roberta_tensorizer.
RoBERTaTokenLevelTensorizer
(columns, tokenizer=None, vocab=None, max_seq_len=256, labels_columns=['label'], labels=[])[source]¶ Bases:
pytext.data.roberta_tensorizer.RoBERTaTensorizer
Tensorizer for token level classification tasks such as NER, POS etc using RoBERTa. Here each token has an associated label and the tensorizer should output a label tensor as well. The input for this tensorizer comes from the CoNLLUNERDataSource data source.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
classmethod
from_config
(config: pytext.data.roberta_tensorizer.RoBERTaTokenLevelTensorizer.Config)[source]¶
-
numberize
(row: Dict[KT, VT]) → Tuple[Any, ...][source]¶ Numberize both the tokens and labels. Since we break up tokens, the label for anything other than the first sub-word is assigned the padding idx.
-
pytext.data.squad_for_bert_tensorizer module¶
-
class
pytext.data.squad_for_bert_tensorizer.
SquadForBERTTensorizer
(answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizer
Produces BERT inputs and answer spans for Squad.
-
SPAN_PAD_IDX
= -100¶
-
classmethod
from_config
(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
-
class
pytext.data.squad_for_bert_tensorizer.
SquadForBERTTensorizerForKD
(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]¶ Bases:
pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer
-
classmethod
from_config
(config: pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizerForKD.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
-
class
pytext.data.squad_for_bert_tensorizer.
SquadForRoBERTaTensorizer
(answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]¶ Bases:
pytext.data.roberta_tensorizer.RoBERTaTensorizer
,pytext.data.squad_for_bert_tensorizer.SquadForBERTTensorizer
Produces RoBERTa inputs and answer spans for Squad.
-
classmethod
from_config
(config: pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
-
class
pytext.data.squad_for_bert_tensorizer.
SquadForRoBERTaTensorizerForKD
(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]¶ Bases:
pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizer
-
classmethod
from_config
(config: pytext.data.squad_for_bert_tensorizer.SquadForRoBERTaTensorizerForKD.Config, **kwargs)[source]¶ from_config parses the config associated with the tensorizer and creates both the tokenizer and the Vocabulary object. The extra arguments passed as kwargs allow us to reuse thie function with variable number of arguments (eg: for classes which derive from this class).
-
classmethod
pytext.data.squad_tensorizer module¶
-
class
pytext.data.squad_tensorizer.
SquadTensorizer
(doc_tensorizer: pytext.data.tensorizers.TokenTensorizer, ques_tensorizer: pytext.data.tensorizers.TokenTensorizer, doc_column: str = 'doc', ques_column: str = 'question', answers_column: str = 'answers', answer_starts_column: str = 'answer_starts', **kwargs)[source]¶ Bases:
pytext.data.tensorizers.TokenTensorizer
Produces inputs and answer spans for Squad.
-
SPAN_PAD_IDX
= -100¶
-
classmethod
from_config
(config: pytext.data.squad_tensorizer.SquadTensorizer.Config, **kwargs)[source]¶
-
-
class
pytext.data.squad_tensorizer.
SquadTensorizerForKD
(start_logits_column='start_logits', end_logits_column='end_logits', has_answer_logits_column='has_answer_logits', pad_mask_column='pad_mask', segment_labels_column='segment_labels', **kwargs)[source]¶
pytext.data.tensorizers module¶
-
class
pytext.data.tensorizers.
AnnotationNumberizer
(column: str = 'seqlogical', vocab=None, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Not really a Tensorizer (since it does not create tensors) but technically serves the same function. This class parses Annotations in the format below and extracts the actions (type List[List[int]])
[IN:GET_ESTIMATED_DURATION How long will it take to [SL:METHOD_TRAVEL drive ] from [SL:SOURCE Chicago ] to [SL:DESTINATION Mississippi ] ]
Extraction algorithm is handled by Annotation class. We only care about the list of actions, which before vocab index lookups would look like:
[ IN:GET_ESTIMATED_DURATION, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT, SHIFT, SL:METHOD_TRAVEL, SHIFT, REDUCE, SHIFT, SL:SOURCE, SHIFT, REDUCE, SHIFT, SL:DESTINATION, SHIFT, REDUCE, ]
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
ByteTensorizer
(text_column, lower=True, max_seq_len=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Turn characters into sequence of int8 bytes. One character will have one or more bytes depending on it’s encoding
-
NUM
= 256¶
-
PAD_BYTE
= 0¶
-
UNK_BYTE
= 0¶
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
ByteTokenTensorizer
(text_column, tokenizer=None, max_seq_len=None, max_byte_len=15, offset_for_non_padding=0, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Turn words into 2-dimensional tensors of int8 bytes. Words are padded to max_byte_len. Also computes sequence lengths (1-D tensor) and token lengths (2-D tensor). 0 is the pad byte.
-
NUM_BYTES
= 256¶
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
CharacterTokenTensorizer
(max_char_length: int = 20, **kwargs)[source]¶ Bases:
pytext.data.tensorizers.TokenTensorizer
Turn words into 2-dimensional tensors of ints based on their ascii values. Words are padded to the maximum word length (also capped at max_char_length). Sequence lengths are the length of each token, 0 for pad token.
-
initialize
(from_scratch=True)¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
-
class
pytext.data.tensorizers.
CharacterVocabTokenTensorizer
(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Turn words into 2-dimensional tensors of ints based on the char vocab. Words are padded to the maximum word length (also capped at max_char_length). Sequence lengths are the length of each token.
The difference with pytext.data.tensorizers.CharacterTokenTensorizer is that the CharacterTokenTensorizer uses the ascii value and does not require to build a vocab. Here we tensorize based on the vocab.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
classmethod
from_config
(config: pytext.data.tensorizers.CharacterVocabTokenTensorizer.Config)[source]¶
-
initialize
(vocab_builder=None, from_scratch=True)[source]¶ Build vocabulary based on training corpus.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.tensorizers.
CharacterVocabTokenTensorizerScriptImpl
(add_bos_token: bool, add_eos_token: bool, use_eos_token_for_bos: bool, max_seq_len: int, vocab: pytext.data.utils.Vocabulary, tokenizer: Optional[pytext.data.tokenizers.tokenizer.Tokenizer])[source]¶ Bases:
pytext.data.tensorizers.TensorizerScriptImpl
-
forward
(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
get_tokens_by_index
(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[str]][source]¶
-
numberize
(char_tokens: List[List[str]], char_tokens_lengths: List[int]) → Tuple[List[List[int]], List[int]][source]¶ This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tensorize
(tokens: List[List[List[int]]], tokens_lengths: List[List[int]]) → Tuple[torch.Tensor, torch.Tensor][source]¶ This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tokenize
(row_text: Optional[str] = None, row_pre_tokenized: Optional[List[str]] = None) → Tuple[List[List[str]], List[int]][source]¶ This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
-
class
pytext.data.tensorizers.
Float1DListTensorizer
(config: pytext.data.tensorizers.Float1DListTensorizer.Config, **kwargs)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Tensorizes the 1d list of floats – List[float] TODO: Even though very similar, ‘FloatListTensorizer’ currently does not support this vanilla case for tensorization of List[float]. In future, if ‘FloatListTensorizer’ accommodates this case, we do not need this separate tensorizer.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
classmethod
from_config
(config: pytext.data.tensorizers.Float1DListTensorizer.Config, **kwargs)[source]¶
-
initialize
(from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.tensorizers.
FloatListSeqTensorizer
(column: str, error_check: bool, dim: Optional[int], pad_token: float = -1.0, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize numeric labels.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.tensorizers.
FloatListTensorizer
(column: str, error_check: bool, dim: Optional[int], normalize: bool, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize numeric labels.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
()[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
-
class
pytext.data.tensorizers.
FloatTensorizer
(column: str, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
A tensorizer for reading in scalars from the data.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
GazetteerTensorizer
(text_column: str = 'text', dict_column: str = 'dict', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Create 3 tensors for dict features.
- idx: index of feature in token order.
- weights: weight of feature in token order.
- lens: number of features per token.
For each input token, there will be the same number of idx and weights entries. (equal to the max number of features any token has in this row). The values in lens will tell how many of these features are actually used per token.
Input format for the dict column is json and should be a list of dictionaries containing the “features” and their weight for each relevant “tokenIdx”. Example:
text: "Order coffee from Starbucks please" dict: [ {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}}, {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}} ]
if we assume this vocab
vocab = { UNK: 0, PAD: 1, "drink/beverage": 2, "music/song": 3, "store/coffee_shop": 4 }
this example will result in those tensors:
idx = [1, 1, 2, 3, 1, 1, 4, 1, 1, 1] weights = [0.0, 0.0, 0.8, 0.2, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0] lens = [1, 2, 1, 1, 1]
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ Look through the dataset for all dict features to create vocab.
-
class
pytext.data.tensorizers.
Integer1DListTensorizer
(config: pytext.data.tensorizers.Integer1DListTensorizer.Config, **kwargs)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Tensorizes the 1d list of integers – List[int]
-
SPAN_PAD_IDX
= 0¶
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
classmethod
from_config
(config: pytext.data.tensorizers.Integer1DListTensorizer.Config, **kwargs)[source]¶
-
initialize
(from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.tensorizers.
LabelListRankTensorizer
(*args, pad_missing: bool = False, **kwargs)[source]¶ Bases:
pytext.data.tensorizers.LabelTensorizer
LabelListRankTensorizer takes a list of a single array with [[labelA, rankA], [labelB, rankB], …] as input and generate a tuple of tensors (label_idx, list_length). Example: Input: [“[“weather”,”1”]”,”[“business”,”1”]”] Output of size len(vocab) {“timer”, “weather”, “business”} => [0, 1, 1]. This would suggest both labels are of equal rank.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
LabelListTensorizer
(*args, pad_missing: bool = False, **kwargs)[source]¶ Bases:
pytext.data.tensorizers.LabelTensorizer
LabelListTensorizer takes a list of labels as input and generate a tuple of tensors (label_idx, list_length).
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
LabelTensorizer
(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, label_vocab_file: Optional[str] = None, is_input: bool = False, add_labels: Optional[List[str]] = None)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize labels. Label can be used as either input or target.
NB: if the labels are used as targets for binary classification with a loss such as cosine distance, the order of the label_vocab does matter, and it should be [negative_class, positive_class].
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
MetricTensorizer
(names: List[str], indexes: List[int], is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
A tensorizer which use other tensorizers’ numerized data. Used mostly for metric reporting.
-
class
pytext.data.tensorizers.
NtokensTensorizer
(names: List[str], indexes: List[int], is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.MetricTensorizer
A tensorizer which will reference another tensorizer’s numerized data to calculate the num tokens. Used for calculating tokens per second.
-
class
pytext.data.tensorizers.
NumericLabelTensorizer
(label_column: str = 'label', rescale_range: Optional[List[float]] = None, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize numeric labels.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
SeqTokenTensorizer
(column: str = 'text_seq', tokenizer=None, add_bos_token: bool = False, add_eos_token: bool = False, use_eos_token_for_bos: bool = False, add_bol_token: bool = False, add_eol_token: bool = False, use_eol_token_for_bol: bool = False, max_seq_len=None, vocab=None, is_input: bool = True, max_turn=50)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Tensorize a sequence of sentences. The input is a list of strings, like this one:
["where do you wanna meet?", "MPK"]
if we assume this vocab
vocab { UNK: 0, PAD: 1, 'where': 2, 'do': 3, 'you': 4, 'wanna': 5, 'meet?': 6, 'mpk': 7 }
this example will result in those tensors:
idx = [[2, 3, 4, 5, 6], [7, 1, 1, 1, 1]] sentence_len = [5, 1] seq_len = [2]
If you’re using BOS, EOS, BOL and EOL, the vocab will look like this
vocab { UNK: 0, PAD: 1, BOS: 2, EOS: 3, BOL: 4, EOL: 5 'where': 6, 'do': 7, 'you': 8, 'wanna': 9, 'meet?': 10, 'mpk': 11 }
this example will result in those tensors:
idx = [ [2, 4, 3, 1, 1, 1, 1], [2, 6, 7, 8, 9, 10, 3], [2, 11, 3, 1, 1, 1, 1], [2, 5, 3, 1, 1, 1, 1] ] sentence_len = [3, 8, 3, 3] seq_len = [4]
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
SlotLabelTensorizer
(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize word/slot labels.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ Look through the dataset for all labels and create a vocab map for them.
-
-
class
pytext.data.tensorizers.
SlotLabelTensorizerExpansible
(slot_column: str = 'slots', text_column: str = 'text', tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, allow_unknown: bool = False, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.SlotLabelTensorizer
Create a base SlotLabelTensorizer to support selecting different types in ModelInput.
-
class
pytext.data.tensorizers.
SoftLabelTensorizer
(label_column: str = 'label', allow_unknown: bool = False, pad_in_vocab: bool = False, label_vocab: Optional[List[str]] = None, probs_column: str = 'target_probs', logits_column: str = 'target_logits', labels_column: str = 'target_labels', label_vocab_file: Optional[str] = None, is_input: bool = False)[source]¶ Bases:
pytext.data.tensorizers.LabelTensorizer
Handles numberizing labels for knowledge distillation. This still requires the same label column as LabelTensorizer for the “true” label, but also processes soft “probabilistic” labels generated from a teacher model, via three new columns.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
String2DListTensorizer
(column, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.tensorizers.
String2DListTensorizerScriptImpl
(vocab: pytext.data.utils.Vocabulary)[source]¶ Bases:
pytext.data.tensorizers.TensorizerScriptImpl
-
forward
(inputs: List[List[List[str]]]) → Tuple[torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
numberize
(tokens: List[List[str]]) → Tuple[List[List[int]], List[int], int][source]¶ This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tensorize
(tokens_3d: List[List[List[int]]], seq_lens_2d: List[List[int]], seq_lens_1d: List[int]) → Tuple[torch.Tensor, torch.Tensor][source]¶ This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
-
class
pytext.data.tensorizers.
Tensorizer
(is_input: bool = True)[source]¶ Bases:
pytext.config.component.Component
Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.
Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.tensorizers.
TensorizerScriptImpl
[source]¶ Bases:
torch.nn.modules.module.Module
-
get_tokens_by_index
(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[List[str]]][source]¶
-
numberize
(*args, **kwargs)[source]¶ This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
set_padding_control
(dimension: str, padding_control: Optional[List[int]])[source]¶ This functions will be called to set a padding style. None - No padding List: first element 0, round seq length to the smallest list element larger than inputs
-
tensorize
(*args, **kwargs)[source]¶ This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tensorize_wrapper
(*args, **kwargs)[source]¶ This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.
It will be called in PyText Tensorizer during training time, this function is not torchscriptiable because it depends on cuda.device().
-
tokenize
(*args, **kwargs)[source]¶ This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
-
class
pytext.data.tensorizers.
TokenTensorizer
(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Convert text to a list of tokens. Do this based on a tokenizer configuration, and build a vocabulary for numberization. Finally, pad the batch to create a square tensor of the correct size.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
UidTensorizer
(uid_column: str = 'uid', allow_unknown: bool = True, is_input: bool = True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
Numberize user IDs which can be either strings or tensors.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
-
class
pytext.data.tensorizers.
VocabConfig
(**kwargs)[source]¶ Bases:
pytext.config.component.Component.Config
-
build_from_data
= True¶ Whether to add tokens from training data to vocab.
-
min_counts
= 0¶ Add min_counts filter out tokens in training data that with count smaller than min_counts.
-
size_from_data
= 0¶ Add size_from_data most frequent tokens in training data to vocab (if this is 0, add all tokens from training data).
-
vocab_files
= []¶
-
-
class
pytext.data.tensorizers.
VocabFileConfig
(**kwargs)[source]¶ Bases:
pytext.config.component.Component.Config
-
filepath
= ''¶ File containing tokens to add to vocab (first whitespace-separated entry per line)
-
lowercase_tokens
= False¶ Whether to lowercase each of the tokens in the file
-
size_limit
= 0¶ The max number of tokens to add to vocab
-
skip_header_line
= False¶ Whether to skip the first line of the file (e.g. if it is a header line)
-
-
pytext.data.tensorizers.
initialize_tensorizers
(tensorizers, data_source, from_scratch=True)[source]¶ A utility function to stream a data source to the initialize functions of a dict of tensorizers.
-
pytext.data.tensorizers.
lookup_tokens
(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, vocab: pytext.data.utils.Vocabulary = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]¶
-
pytext.data.tensorizers.
tokenize
(text: str = None, pre_tokenized: List[pytext.data.tokenizers.tokenizer.Token] = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, bos_token: Optional[str] = None, eos_token: Optional[str] = None, pad_token: str = '__PAD__', use_eos_token_for_bos: bool = False, max_seq_len: int = 1073741824)[source]¶
pytext.data.token_tensorizer module¶
-
class
pytext.data.token_tensorizer.
ScriptBasedTokenTensorizer
(text_column, tokenizer=None, add_bos_token=False, add_eos_token=False, use_eos_token_for_bos=False, max_seq_len=None, vocab_config=None, vocab=None, vocab_file_delimiter=' ', is_input=True)[source]¶ Bases:
pytext.data.tensorizers.Tensorizer
An Implementation of TokenTensorizer that uses a TorchScript module in the background and is hence torchscriptifiable.
Note that unlike the original TokenTensorizer, this version cannot deal with arbitrarily nested lists of tokens.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
classmethod
from_config
(config: pytext.data.token_tensorizer.ScriptBasedTokenTensorizer.Config)[source]¶
-
initialize
(vocab_builder=None, from_scratch=True)[source]¶ Build vocabulary based on training corpus.
-
numberize
(row)[source]¶ Tokenize and look up in vocabulary.
A few notable things:
1) We’re using the non-torchsciptified tokenizer here. This allows us to use non-torchscriptifiable tokenizers if we don’t intend to torchscriptify this module.
2) When using the ScriptImpl to do the lookup, it takes care of the BOS / EOS stuff there. Hence we don’t need to do that with the tokenizer.
3) The tokenize function from tensorizer.py returns a tuple of (tokens, start_indices, end_indices), while the ScriptImpl expects a list of (token, start_idx, end_idx) tuples so we need to unzip these
-
prepare_input
(row)[source]¶ Tokenize, look up in vocabulary, return tokenized_texts in raw text
Similarly to the above function, tokenization is done with the original and not the torchscriptified tokenizer.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.token_tensorizer.
TokenTensorizerScriptImpl
(add_bos_token: bool, add_eos_token: bool, use_eos_token_for_bos: bool, max_seq_len: int, vocab: pytext.data.utils.Vocabulary, tokenizer: Optional[pytext.data.tokenizers.tokenizer.Tokenizer])[source]¶ Bases:
pytext.data.tensorizers.TensorizerScriptImpl
-
forward
(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
get_tokens_by_index
(tokens: Optional[List[List[List[str]]]], index: int) → Optional[List[str]][source]¶
-
numberize
(text_tokens: List[Tuple[str, int, int]]) → Tuple[List[int], int, List[Tuple[int, int]]][source]¶ This functions will receive the outputs from function: tokenize() or will be called directly from PyTextTensorizer function: numberize().
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tensorize
(tokens_2d: List[List[int]], seq_lens_1d: List[int], positions_2d: List[List[Tuple[int, int]]]) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor][source]¶ This functions will receive a list(e.g a batch) of outputs from function numberize(), padding and convert to output tensors.
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
tokenize
(row_text: Optional[str], row_pre_tokenized: Optional[List[str]]) → List[Tuple[str, int, int]][source]¶ This functions will receive the inputs from Clients, usually there are two possible inputs 1) a row of texts: List[str] 2) a row of pre-processed tokens: List[List[str]]
Override this function to be TorchScriptable, e.g you need to declare concrete input arguments with type hints.
-
pytext.data.utils module¶
-
class
pytext.data.utils.
VocabBuilder
(delimiter=' ')[source]¶ Bases:
object
Helper class for aggregating and building Vocabulary objects.
-
class
pytext.data.utils.
Vocabulary
(vocab_list: List[str], counts: List[T] = None, replacements: Optional[Dict[str, str]] = None, unk_token: str = '__UNKNOWN__', pad_token: str = '__PAD__', bos_token: str = '__BEGIN_OF_SENTENCE__', eos_token: str = '__END_OF_SENTENCE__', mask_token: str = '__MASK__')[source]¶ Bases:
object
A mapping from indices to vocab elements.
-
pytext.data.utils.
align_target_label
(targets: List[float], labels: List[str], label_vocab: Dict[str, int]) → List[float][source]¶ Given targets that are ordered according to labels, align the targets to match the order of label_vocab.
-
pytext.data.utils.
align_target_labels
(targets_list: List[List[float]], labels_list: List[List[str]], label_vocab: Dict[str, int]) → List[List[float]][source]¶ Given targets_list that are ordered according to labels_list, align the targets to match the order of label_vocab.
-
pytext.data.utils.
pad
(nested_lists, pad_token, pad_shape=None)[source]¶ Pad the input lists with the pad token. If pad_shape is provided, pad to that shape, otherwise infer the input shape and pad out to a square tensor shape.
pytext.data.xlm_constants module¶
pytext.data.xlm_dictionary module¶
pytext.data.xlm_tensorizer module¶
-
class
pytext.data.xlm_tensorizer.
XLMTensorizer
(columns: List[str] = ['text'], vocab: pytext.data.utils.Vocabulary = None, tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer = None, max_seq_len: int = 256, language_column: str = 'language', lang2id: Dict[str, int] = {'ar': 0, 'bg': 1, 'de': 2, 'el': 3, 'en': 4, 'es': 5, 'fr': 6, 'hi': 7, 'ru': 8, 'sw': 9, 'th': 10, 'tr': 11, 'ur': 12, 'vi': 13, 'zh': 14}, use_language_embeddings: bool = True, has_language_in_data: bool = False)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBase
Tensorizer for Cross-lingual LM tasks. Works for single sentence as well as sentence pair.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
numberize
(row: Dict[KT, VT]) → Tuple[Any, ...][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
-
tensorizer_script_impl
= None¶
-
-
class
pytext.data.xlm_tensorizer.
XLMTensorizerScriptImpl
(tokenizer: pytext.data.tokenizers.tokenizer.Tokenizer, vocab: pytext.data.utils.Vocabulary, max_seq_len: int, language_vocab: List[str], default_language: str)[source]¶ Bases:
pytext.data.bert_tensorizer.BERTTensorizerBaseScriptImpl
-
forward
(inputs: pytext.torchscript.utils.ScriptBatchInput) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor][source]¶ Wire up tokenize(), numberize() and tensorize() functions for data processing.
-
numberize
(per_sentence_tokens: List[List[Tuple[str, int, int]]], per_sentence_languages: List[int]) → Tuple[List[int], List[int], int, List[int]][source]¶ This function contains logic for converting tokens into ids based on the specified vocab. It also outputs, for each instance, the vectors needed to run the actual model.
Parameters: - per_sentence_tokens – list of tokens per sentence level in one row,
- token represented by token string, start and end indices. (each) –
Returns: List[int], a list of token ids, concatenate all sentences token ids. segment_labels: List[int], denotes each token belong to which sentence. seq_len: int, tokens length positions: List[int], token positions
Return type: tokens
-
Module contents¶
-
class
pytext.data.
AlternatingRandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float], second_unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.RandomizedBatchSampler
This sampler takes in a dictionary of iterators and returns batches alternating between keys and probabilities specified by unnormalized_iterator_probs and ‘second_unnormalized_iterator_probs’, This is used for example in XLM pre-training where we alternate between MLM and TLM batches.
-
class
pytext.data.
Batcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16)[source]¶ Bases:
pytext.config.component.Component
Batcher designed to batch rows of data, before padding.
-
class
pytext.data.
BatchIterator
(batches, processor, include_input=True, include_target=True, include_context=True, is_train=True, num_batches=0)[source]¶ Bases:
object
BatchIterator is a wrapper of TorchText. Iterator that provide flexibility to map batched data to a tuple of (input, target, context) and other additional steps such as dealing with distributed training.
Parameters: - batches (Iterator[TorchText.Batch]) – iterator of TorchText.Batch, which shuffles/batches the data in __iter__ and return a batch of data in __next__
- processor – function to run after getting batched data from TorchText.Iterator, the function should define a way to map to data into (input, target, context)
- include_input (bool) – if input data should be returned, default is true
- include_target (bool) – if target data should be returned, default is true
- include_context (bool) – if context data should be returned, default is true
- is_train (bool) – if the batch data is for training
- num_batches (int) – total batches to generate, this param if for distributed training due to a limitation in PyTorch’s distributed training backend that enforces all the parallel workers to have the same number of batches we workaround it by adding dummy batches at the end
-
class
pytext.data.
Data
(data_source: pytext.data.sources.data_source.DataSource, tensorizers: Dict[str, pytext.data.tensorizers.Tensorizer], batcher: pytext.data.data.Batcher = None, sort_key: Optional[str] = None, in_memory: Optional[bool] = True, init_tensorizers: Optional[bool] = True, init_tensorizers_from_scratch: Optional[bool] = True)[source]¶ Bases:
pytext.config.component.Component
Data is an abstraction that handles all of the following:
- Initialize model metadata parameters
- Create batches of tensors for model training or prediction
It can accomplish these in any way it needs to. The base implementation utilizes pytext.data.sources.DataSource, and sends batches to pytext.data.tensorizers.Tensorizer to create tensors.
The tensorizers dict passed to the initializer should be considered something like a signature for the model. Each batch should be a dictionary with the same keys as the tensorizers dict, and values should be tensors arranged in the way specified by that tensorizer. The tensorizers dict doubles as a simple baseline implementation of that same signature, but subclasses of Data can override the implementation using other methods. This value is how the model specifies what inputs it’s looking for.
-
batches
(stage: pytext.common.constants.Stage, data_source=None, load_early=False)[source]¶ Create batches of tensors to pass to model train_batch. This function yields dictionaries that mirror the tensorizers dict passed to __init__, ie. the keys will be the same, and the tensors will be the shape expected from the respective tensorizers.
stage is used to determine which data source is used to create batches. if data_source is provided, it is used instead of the configured data_sorce this is to allow setting a different data_source for testing a model.
Passing in load_early = True disables loading all data in memory and using PoolingBatcher, so that we get the first batch as quickly as possible.
-
class
pytext.data.
DataHandler
(raw_columns: List[str], labels: Dict[str, pytext.fields.field.Field], features: Dict[str, pytext.fields.field.Field], featurizer: pytext.data.featurizer.featurizer.Featurizer, extra_fields: Dict[str, pytext.fields.field.Field] = None, text_feature_name: str = 'word_feat', shuffle: bool = True, sort_within_batch: bool = True, train_path: str = 'train.tsv', eval_path: str = 'eval.tsv', test_path: str = 'test.tsv', train_batch_size: int = 128, eval_batch_size: int = 128, test_batch_size: int = 128, max_seq_len: int = -1, pass_index: bool = True, column_mapping: Dict[str, str] = None, **kwargs)[source]¶ Bases:
pytext.config.component.Component
DataHandler is the central place to prepare data for model training/testing. The class is responsible of:
- Define pipeline to process data and generate batch of tensors to be consumed by model. Each batch is a (input, target, extra_data) tuple, in which input can be feed directly into model.
- Initialize global context, such as build vocab, load pretrained embeddings. Store the context as metadata, and provide function to serialize/deserialize the metadata
The data processing pipeline contains the following steps:
- Read data from file into a list of raw data examples
- Convert each row of row data to a TorchText Example. This logic happens
in process_row function and will:
- Invoke featurizer, which contains data processing steps to apply for both training and inference time, e.g: tokenization
- Use the raw data and results from featurizer to do any preprocessing
- Generate a TorchText.Dataset that contains the list of Example, the Dataset also has a list of TorchText.Field, which defines how to do padding and numericalization while batching data.
- Return a BatchIterator which will give a tuple of (input, target, context) tensors for each iteration. By default the tensors have a 1:1 mapping to the TorchText.Field fields, but this behavior can be overwritten by _input_from_batch, _target_from_batch, _context_from_batch functions.
-
raw_columns
¶ columns to read from data source. The order should match the data stored in that file.
Type: List[str]
-
featurizer
¶ perform data preprocessing that should be shared between training and inference
Type: Featurizer
-
features
¶ a dict of name -> field that used to process data as model input
Type: Dict[str, Field]
-
labels
¶ a dict of name -> field that used to process data as training target
Type: Dict[str, Field]
-
extra_fields
¶ fields that process any extra data used neither as model input nor target. This is None by default
Type: Dict[str, Field]
-
text_feature_name
¶ name of the text field, used to define the default sort key of data
Type: str
-
shuffle
¶ if the dataset should be shuffled, true by default
Type: bool
-
sort_within_batch
¶ if data within same batch should be sorted, true by default
Type: bool
-
train_path
¶ path of training data file
Type: str
-
eval_path
¶ path of evaluation data file
Type: str
-
test_path
¶ path of test data file
Type: str
-
train_batch_size
¶ training batch size, 128 by default
Type: int
-
eval_batch_size
¶ evaluation batch size, 128 by default
Type: int
-
test_batch_size
¶ test batch size, 128 by default
Type: int
-
max_seq_len
¶ maximum length of tokens to keep in sequence
Type: int
-
pass_index
¶ if the original index of data in the batch should be passed along to downstream steps, default is true
Type: bool
-
gen_dataset
(data: Iterable[Dict[str, Any]], include_label_fields: bool = True, shard_range: Tuple[int, int] = None) → torchtext.legacy.data.dataset.Dataset[source]¶ Generate torchtext Dataset from raw in memory data. :returns: dataset (TorchText.Dataset)
-
gen_dataset_from_path
(path: str, rank: int = 0, world_size: int = 1, include_label_fields: bool = True, use_cache: bool = True) → torchtext.legacy.data.dataset.Dataset[source]¶ Generate a dataset from file :returns: dataset (TorchText.Dataset)
-
get_test_iter_from_path
(test_path: str, batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_test_iter_from_raw_data
(test_data: List[Dict[str, Any]], batch_size: int) → pytext.data.data_handler.BatchIterator[source]¶
-
get_train_iter_from_path
(train_path: str, batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶ Generate data batch iterator for training data. See _get_train_iter() for details
Parameters: - train_path (str) – file path of training data
- batch_size (int) – batch size
- rank (int) – used for distributed training, the rank of current Gpu, don’t set it to anything but 0 for non-distributed training
- world_size (int) – used for distributed training, total number of Gpu
-
get_train_iter_from_raw_data
(train_data: List[Dict[str, Any]], batch_size: int, rank: int = 0, world_size: int = 1) → pytext.data.data_handler.BatchIterator[source]¶
-
init_feature_metadata
(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]¶
-
init_metadata_from_path
(train_path, eval_path, test_path)[source]¶ Initialize metadata using data from file
-
init_target_metadata
(train_data: torchtext.legacy.data.dataset.Dataset, eval_data: torchtext.legacy.data.dataset.Dataset, test_data: torchtext.legacy.data.dataset.Dataset)[source]¶
-
load_metadata
(metadata: pytext.data.data_handler.CommonMetadata)[source]¶ Load previously saved metadata
-
load_vocab
(vocab_file, vocab_size, lowercase_tokens: bool = False)[source]¶ Loads items into a set from a file containing one item per line. Items are added to the set from top of the file to bottom. So, the items in the file should be ordered by a preference (if any), e.g., it makes sense to order tokens in descending order of frequency in corpus.
Parameters: - vocab_file (str) – vocab file to load
- vocab_size (int) – maximum tokens to load, will only load the first n if the actual vocab size is larger than this parameter
- lowercase_tokens (bool) – if the tokens should be lowercased
-
preprocess
(data: Iterable[Dict[str, Any]])[source]¶ preprocess the raw data to create TorchText.Example, this is the second step in whole processing pipeline :returns: data (Generator[Dict[str, Any]])
-
preprocess_row
(row_data: Dict[str, Any]) → Dict[str, Any][source]¶ preprocess steps for a single input row, sub class should override it
-
read_from_file
(file_name: str, columns_to_use: Union[Dict[str, int], List[str]]) → Generator[Dict[KT, VT], None, None][source]¶ Read data from csv file. Input file format is required to be tab-separated columns
Parameters: - file_name (str) – csv file name
- columns_to_use (Union[Dict[str, int], List[str]]) – either a list of column names or a dict of column name -> column index in the file
-
class
pytext.data.
DisjointMultitaskData
(data_dict: Dict[str, pytext.data.data.Data], samplers: Dict[pytext.common.constants.Stage, pytext.data.batch_sampler.BaseBatchSampler], test_key: str = None, task_key: str = 'task_name')[source]¶ Bases:
pytext.data.data.Data
Wrapper for doing multitask training using multiple data objects. Takes a dictionary of data objects, does round robin over their iterators using BatchSampler.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskData.Config.
- data_dict (Dict[str, Data]) – Data objects to do roundrobin over.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_dict
¶ Data handlers to do roundrobin over.
Type: type
-
class
pytext.data.
DisjointMultitaskDataHandler
(config: pytext.data.disjoint_multitask_data_handler.DisjointMultitaskDataHandler.Config, data_handlers: Dict[str, pytext.data.data_handler.DataHandler], target_task_name: Optional[str] = None, *args, **kwargs)[source]¶ Bases:
pytext.data.data_handler.DataHandler
Wrapper for doing multitask training using multiple data handlers. Takes a dictionary of data handlers, does round robin over their iterators using RoundRobinBatchIterator.
Parameters: - config (Config) – Configuration object of type DisjointMultitaskDataHandler.Config.
- data_handlers (Dict[str, DataHandler]) – Data handlers to do roundrobin over.
- target_task_name (Optional[str]) – Used to select best epoch, and set batch_per_epoch.
- *args (type) – Extra arguments to be passed down to sub data handlers.
- **kwargs (type) – Extra arguments to be passed down to sub data handlers.
-
data_handlers
¶ Data handlers to do roundrobin over.
Type: type
-
target_task_name
¶ Used to select best epoch, and set batch_per_epoch.
Type: type
-
upsample
¶ If upsample, keep cycling over each iterator in round-robin. Iterators with less batches will get more passes. If False, we do single pass over each iterator, the ones which run out will sit idle. This is used for evaluation. Default True.
Type: bool
-
class
pytext.data.
DynamicPoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1, scheduler_config=<pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig object>)[source]¶ Bases:
pytext.data.data.PoolingBatcher
Allows dynamic batch training, extends pooling batcher with a scheduler config, which specifies how batch size should increase
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
compute_dynamic_batch_size
(curr_epoch: int, scheduler_config: pytext.data.dynamic_pooling_batcher.BatcherSchedulerConfig, curr_steps: int) → int[source]¶
-
-
class
pytext.data.
EvalBatchSampler
[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of Iterators and returns batches associated with each key in the dictionary. It guarentees that we will see each batch associated with each key exactly once in the epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
Output: [A, B, C, D, a, b]
-
pytext.data.
generator_iterator
(fn)[source]¶ Turn a generator into a GeneratorIterator-wrapped function. Effectively this allows iterating over a generator multiple times by recording the call arguments, and calling the generator with them anew each item __iter__ is called on the returned object.
-
class
pytext.data.
PoolingBatcher
(train_batch_size=16, eval_batch_size=16, test_batch_size=16, pool_num_batches=1000, num_shuffled_pools=1)[source]¶ Bases:
pytext.data.data.Batcher
Batcher that shuffles and (if requested) sorts data.
Rationale
There is a trade-off between having batches of data that are truly randomly shuffled, and batches of data that are efficiently padded. If we wanted to maximise the efficiency of padding (i.e. minimise the amount of padding that is needed), we would have to enforce that all inputs of a similar length appear in the same batch. This however would lead to a dramatic decrease in the randomness of batches. On the other end of the spectrum, if we wanted to maximise randomness, we would often end up with inputs of wildly different lengths in the same batch, which would lead to a lot of padding.
Operation
This batcher uses a multi-staged approach.
- It first loads a number of “pools” of data, and shuffles them (this is controlled by num_shuffled_pools).
- It then splits up the shuffled data sequentially into individual pools, and the examples within each pool are sorted (if requested).
- Finally, each pool is split up sequentially into batches, and yielded. If sorting was requested in step #2, the order in which the batches are yielded is randomised.
The size of a pool is expressed as a multiple of the batch size, and is controlled by pool_num_batches.
Examples
Assuming sorting is enabled, with the default settings of pool_num_batches: 1000 and num_shuffled_pools: 1, a pool of 1k * batch_size examples is loaded, sorted by length, and split up into 1k batches. These batches are then yielded in random order. Once they run out, a new pool is loaded, and the process is repeated. An advantage of this approach is that padding will be somewhat reduced. A disadvantage is that, for every epoch, the first 1k batches will be always the same (albeit in a different order).
On the other hand, specifying pool_num_batches: 1000 and num_shuffled_pools: 1000 would achieve the following: 1k * 1k * batch_size examples are loaded, and shuffled. These are then split up into pools of size 1k * batch_size, which are then sorted internally, split into individual batches, and yielded in random order. Compared to the previous example, we no longer have the problem that the first 1k batches are always the same in each epoch, but we’ve had to load in memory 1M examples.
-
batchify
(iterable: Iterable[pytext.data.sources.data_source.RawExample], sort_key=None, stage=<Stage.TRAIN: 'Training'>)[source]¶ From an iterable of dicts, yield dicts of lists:
- Load num_shuffled_pools pools of data, and shuffle them.
- Load a pool (batch_size * pool_num_batches examples).
- Sort rows, if necessary.
- Shuffle the order in which the batches are returned, if necessary.
-
class
pytext.data.
RandomizedBatchSampler
(unnormalized_iterator_probs: Dict[str, float])[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes in a dictionary of iterators and returns batches according to the specified probabilities by unnormalized_iterator_probs. We cycle through the iterators (restarting any that “run out”) indefinitely. Set batches_per_epoch in Trainer.Config.
Example
Iterator A: [A, B, C, D], Iterator B: [a, b]
batches_per_epoch = 3, unnormalized_iterator_probs = {“A”: 0, “B”: 1} Epoch 1 = [a, b, a] Epoch 2 = [b, a, b]
Parameters: unnormalized_iterator_probs (Dict[str, float]) – Iterator sampling probabilities. The keys should be the same as the keys of the underlying iterators, and the values will be normalized to sum to 1.
-
class
pytext.data.
RoundRobinBatchSampler
(iter_to_set_epoch: Optional[str] = None)[source]¶ Bases:
pytext.data.batch_sampler.BaseBatchSampler
This sampler takes a dictionary of Iterators and returns batches in a round robin fashion till a the end of one of the iterators is reached. The end is specified by iter_to_set_epoch.
If iter_to_set_epoch is set, cycle batches from each iterator until one epoch of the target iterator is fulfilled. Iterators with fewer batches than the target iterator are repeated, so they never run out.
If iter_to_set_epoch is None, cycle over batches from each iterator until the shortest iterator completes one epoch.
Example
Iterator 1: [A, B, C, D], Iterator 2: [a, b]
iter_to_set_epoch = “Iterator 1” Output: [A, a, B, b, C, a, D, b]
iter_to_set_epoch = None Output: [A, a, B, b]
Parameters: iter_to_set_epoch (Optional[str]) – Name of iterator to define epoch size. If this is not set, epoch size defaults to the length of the shortest iterator.
-
class
pytext.data.
NaturalBatchSampler
(dataset_counts: Dict[str, int])[source]¶ Bases:
pytext.data.batch_sampler.RandomizedBatchSampler
This sampler iterates over all the datasets, sampling according to the weighted number of samples in each dataset.
-
class
pytext.data.
Tensorizer
(is_input: bool = True)[source]¶ Bases:
pytext.config.component.Component
Tensorizers are a component that converts from batches of pytext.data.type.DataType instances to tensors. These tensors will eventually be inputs to the model, but the model is aware of the tensorizers and can arrange the tensors they create to conform to its model.
Tensorizers have an initialize function. This function allows the tensorizer to read through the training dataset to build up any data that it needs for creating the model. Commonly this is valuable for things like inferring a vocabulary from the training set, or learning the entire set of training labels, or slot labels, etc.
-
column_schema
¶ Generic types don’t pickle well pre-3.7, so we don’t actually want to store the schema as an attribute. We’re already storing all of the columns anyway, so until there’s a better solution, schema is a property.
-
initialize
(from_scratch=True)[source]¶ The initialize function is carefully designed to allow us to read through the training dataset only once, and not store it in memory. As such, it can’t itself manually iterate over the data source. Instead, the initialize function is a coroutine, which is sent row data. This should look roughly like:
# set up variables here ... try: # start reading through data source while True: # row has type Dict[str, types.DataType] row = yield # update any variables, vocabularies, etc. ... except GeneratorExit: # finalize your initialization, set instance variables, etc. ...
See WordTokenizer.initialize for a more concrete example.
-
tensorizer_script_impl
= None¶
-