pytext.fields package

Submodules

pytext.fields.char_field module

class pytext.fields.char_field.CharFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = tensor([[[1, 1, 1]], [[1, 1, 1]]])
numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[str]]]) → List[List[List[str]]][source]

Example of minibatch:

[[['p', 'l', 'a', 'y', '<PAD>', '<PAD>'],
  ['t', 'h', 'a', 't', '<PAD>', '<PAD>'],
  ['t', 'r', 'a', 'c', 'k', '<PAD>'],
  ['o', 'n', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
  ['r', 'e', 'p', 'e', 'a', 't']
 ], ...
]

pytext.fields.contextual_token_embedding_field module

class pytext.fields.contextual_token_embedding_field.ContextualTokenEmbeddingField(**kwargs)[source]

Bases: pytext.fields.field.Field

numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]

Example of padded minibatch:

[[[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [3.1, 3.2, 3.3, 3.4, 3.5],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
]

pytext.fields.dict_field module

class pytext.fields.dict_field.DictFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))
numericalize(arr, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]

Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

pytext.fields.field module

class pytext.fields.field.ActionField(**kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

class pytext.fields.field.DocLabelField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.field.Field(*args, **kwargs)[source]

Bases: torchtext.legacy.data.field.Field

classmethod from_config(config)[source]
get_meta() → pytext.fields.field.FieldMeta[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
pad_length(n)[source]

Override to make pad_length to be multiple of 8 to support fp16 training

class pytext.fields.field.FieldMeta[source]

Bases: object

class pytext.fields.field.FloatField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.field.FloatVectorField(dim=0, dim_error_check=False, **kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.field.NestedField(*args, **kwargs)[source]

Bases: pytext.fields.field.Field, torchtext.legacy.data.field.NestedField

get_meta()[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
class pytext.fields.field.RawField(*args, is_target=False, **kwargs)[source]

Bases: torchtext.legacy.data.field.RawField

get_meta() → pytext.fields.field.FieldMeta[source]
class pytext.fields.field.SeqFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingNestedField

dummy_model_input = tensor([[[1]], [[1]]])
class pytext.fields.field.TextFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

dummy_model_input = tensor([[1], [1]])
class pytext.fields.field.VocabUsingField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.Field

Base class for all fields that need to build a vocabulary.

class pytext.fields.field.VocabUsingNestedField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField, pytext.fields.field.NestedField

Base class for all nested fields that need to build a vocabulary.

class pytext.fields.field.WordLabelField(use_bio_labels, **kwargs)[source]

Bases: pytext.fields.field.Field

get_meta()[source]
pytext.fields.field.create_fields(fields_config, field_cls_dict)[source]
pytext.fields.field.create_label_fields(label_configs, label_cls_dict)[source]

pytext.fields.text_field_with_special_unk module

class pytext.fields.text_field_with_special_unk.TextFeatureFieldWithSpecialUnk(*args, unkify_func=<function unkify>, **kwargs)[source]

Bases: pytext.fields.field.TextFeatureField

build_vocab(*args, min_freq=1, **kwargs)[source]

Code is exactly same as as torchtext.legacy.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.legacy.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.

numericalize(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]

Code is exactly same as torchtext.legacy.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.

Module contents

pytext.fields.create_fields(fields_config, field_cls_dict)[source]
pytext.fields.create_label_fields(label_configs, label_cls_dict)[source]
class pytext.fields.ActionField(**kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

class pytext.fields.CharFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = tensor([[[1, 1, 1]], [[1, 1, 1]]])
numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[str]]]) → List[List[List[str]]][source]

Example of minibatch:

[[['p', 'l', 'a', 'y', '<PAD>', '<PAD>'],
  ['t', 'h', 'a', 't', '<PAD>', '<PAD>'],
  ['t', 'r', 'a', 'c', 'k', '<PAD>'],
  ['o', 'n', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
  ['r', 'e', 'p', 'e', 'a', 't']
 ], ...
]
class pytext.fields.ContextualTokenEmbeddingField(**kwargs)[source]

Bases: pytext.fields.field.Field

numericalize(batch, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]

Example of padded minibatch:

[[[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [3.1, 3.2, 3.3, 3.4, 3.5],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
]
class pytext.fields.DictFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]

Construct the Vocab object for this field from one or more datasets.

Parameters:
  • arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
  • keyword arguments (Remaining) – Passed to the constructor of Vocab.
dummy_model_input = (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))
numericalize(arr, device=None)[source]

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:
  • arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
  • device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
pad(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]

Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

class pytext.fields.DocLabelField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.Field(*args, **kwargs)[source]

Bases: torchtext.legacy.data.field.Field

classmethod from_config(config)[source]
get_meta() → pytext.fields.field.FieldMeta[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
pad_length(n)[source]

Override to make pad_length to be multiple of 8 to support fp16 training

class pytext.fields.FieldMeta[source]

Bases: object

class pytext.fields.FloatField(**kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.FloatVectorField(dim=0, dim_error_check=False, **kwargs)[source]

Bases: pytext.fields.field.Field

class pytext.fields.RawField(*args, is_target=False, **kwargs)[source]

Bases: torchtext.legacy.data.field.RawField

get_meta() → pytext.fields.field.FieldMeta[source]
class pytext.fields.TextFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField

dummy_model_input = tensor([[1], [1]])
class pytext.fields.VocabUsingField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.Field

Base class for all fields that need to build a vocabulary.

class pytext.fields.WordLabelField(use_bio_labels, **kwargs)[source]

Bases: pytext.fields.field.Field

get_meta()[source]
class pytext.fields.NestedField(*args, **kwargs)[source]

Bases: pytext.fields.field.Field, torchtext.legacy.data.field.NestedField

get_meta()[source]
load_meta(metadata: pytext.fields.field.FieldMeta)[source]
class pytext.fields.VocabUsingNestedField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingField, pytext.fields.field.NestedField

Base class for all nested fields that need to build a vocabulary.

class pytext.fields.SeqFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]

Bases: pytext.fields.field.VocabUsingNestedField

dummy_model_input = tensor([[[1]], [[1]]])
class pytext.fields.TextFeatureFieldWithSpecialUnk(*args, unkify_func=<function unkify>, **kwargs)[source]

Bases: pytext.fields.field.TextFeatureField

build_vocab(*args, min_freq=1, **kwargs)[source]

Code is exactly same as as torchtext.legacy.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.legacy.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.

numericalize(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]

Code is exactly same as torchtext.legacy.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.