pytext.fields package¶

Submodules¶

pytext.fields.char_field module¶

class pytext.fields.char_field.CharFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]¶

Construct the Vocab object for this field from one or more datasets.

Parameters:	arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly. keyword arguments (Remaining) – Passed to the constructor of Vocab.

dummy_model_input = tensor([[[1, 1, 1]], [[1, 1, 1]]])¶

numericalize(batch, device=None)[source]¶

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:

arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.

pad(minibatch: List[List[List[str]]]) → List[List[List[str]]][source]¶

Example of minibatch:

[[['p', 'l', 'a', 'y', '<PAD>', '<PAD>'],
  ['t', 'h', 'a', 't', '<PAD>', '<PAD>'],
  ['t', 'r', 'a', 'c', 'k', '<PAD>'],
  ['o', 'n', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
  ['r', 'e', 'p', 'e', 'a', 't']
 ], ...
]

pytext.fields.contextual_token_embedding_field module¶

class pytext.fields.contextual_token_embedding_field.ContextualTokenEmbeddingField(**kwargs)[source]¶

Bases: pytext.fields.field.Field

numericalize(batch, device=None)[source]¶

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:

arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.

pad(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]¶

Example of padded minibatch:

[[[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [3.1, 3.2, 3.3, 3.4, 3.5],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
]

pytext.fields.dict_field module¶

class pytext.fields.dict_field.DictFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]¶

Construct the Vocab object for this field from one or more datasets.

Parameters:	arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly. keyword arguments (Remaining) – Passed to the constructor of Vocab.

dummy_model_input = (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))¶

numericalize(arr, device=None)[source]¶

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:

arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.

pad(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]¶

Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

pytext.fields.field module¶

class pytext.fields.field.ActionField(**kwargs)[source]¶: Bases: pytext.fields.field.VocabUsingField

class pytext.fields.field.DocLabelField(**kwargs)[source]¶: Bases: pytext.fields.field.Field

class pytext.fields.field.Field(*args, **kwargs)[source]¶

Bases: torchtext.legacy.data.field.Field

classmethod from_config(config)[source]¶

get_meta() → pytext.fields.field.FieldMeta[source]¶

load_meta(metadata: pytext.fields.field.FieldMeta)[source]¶

pad_length(n)[source]¶: Override to make pad_length to be multiple of 8 to support fp16 training

class pytext.fields.field.FieldMeta[source]¶: Bases: object

class pytext.fields.field.FloatField(**kwargs)[source]¶: Bases: pytext.fields.field.Field

class pytext.fields.field.FloatVectorField(dim=0, dim_error_check=False, **kwargs)[source]¶: Bases: pytext.fields.field.Field

class pytext.fields.field.NestedField(*args, **kwargs)[source]¶

Bases: pytext.fields.field.Field, torchtext.legacy.data.field.NestedField

get_meta()[source]¶

load_meta(metadata: pytext.fields.field.FieldMeta)[source]¶

class pytext.fields.field.RawField(*args, is_target=False, **kwargs)[source]¶

Bases: torchtext.legacy.data.field.RawField

get_meta() → pytext.fields.field.FieldMeta[source]¶

class pytext.fields.field.SeqFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingNestedField

dummy_model_input = tensor([[[1]], [[1]]])¶

class pytext.fields.field.TextFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField

dummy_model_input = tensor([[1], [1]])¶

class pytext.fields.field.VocabUsingField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶

Bases: pytext.fields.field.Field

Base class for all fields that need to build a vocabulary.

class pytext.fields.field.VocabUsingNestedField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField, pytext.fields.field.NestedField

Base class for all nested fields that need to build a vocabulary.

class pytext.fields.field.WordLabelField(use_bio_labels, **kwargs)[source]¶

Bases: pytext.fields.field.Field

get_meta()[source]¶

pytext.fields.field.create_fields(fields_config, field_cls_dict)[source]¶

pytext.fields.field.create_label_fields(label_configs, label_cls_dict)[source]¶

pytext.fields.text_field_with_special_unk module¶

class pytext.fields.text_field_with_special_unk.TextFeatureFieldWithSpecialUnk(*args, unkify_func=<function unkify>, **kwargs)[source]¶

Bases: pytext.fields.field.TextFeatureField

build_vocab(*args, min_freq=1, **kwargs)[source]¶: Code is exactly same as as torchtext.legacy.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.legacy.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.

numericalize(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]¶: Code is exactly same as torchtext.legacy.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.

Module contents¶

pytext.fields.create_fields(fields_config, field_cls_dict)[source]¶

pytext.fields.create_label_fields(label_configs, label_cls_dict)[source]¶

class pytext.fields.ActionField(**kwargs)[source]¶: Bases: pytext.fields.field.VocabUsingField

class pytext.fields.CharFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]¶

Construct the Vocab object for this field from one or more datasets.

Parameters:	arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly. keyword arguments (Remaining) – Passed to the constructor of Vocab.

dummy_model_input = tensor([[[1, 1, 1]], [[1, 1, 1]]])¶

numericalize(batch, device=None)[source]¶

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:

arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.

pad(minibatch: List[List[List[str]]]) → List[List[List[str]]][source]¶

Example of minibatch:

[[['p', 'l', 'a', 'y', '<PAD>', '<PAD>'],
  ['t', 'h', 'a', 't', '<PAD>', '<PAD>'],
  ['t', 'r', 'a', 'c', 'k', '<PAD>'],
  ['o', 'n', '<PAD>', '<PAD>', '<PAD>', '<PAD>'],
  ['r', 'e', 'p', 'e', 'a', 't']
 ], ...
]

class pytext.fields.ContextualTokenEmbeddingField(**kwargs)[source]¶

Bases: pytext.fields.field.Field

numericalize(batch, device=None)[source]¶

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:

arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.

pad(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]¶

Example of padded minibatch:

[[[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [3.1, 3.2, 3.3, 3.4, 3.5],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [2.1, 2.2, 2.3, 2.4, 2.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
 [[0.1, 0.2, 0.3, 0.4, 0.5],
  [1.1, 1.2, 1.3, 1.4, 1.5],
  [0.0, 0.0, 0.0, 0.0, 0.0],
  [0.0, 0.0, 0.0, 0.0, 0.0],
 ],
]

class pytext.fields.DictFeatureField(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField

build_vocab(*args, **kwargs)[source]¶

Construct the Vocab object for this field from one or more datasets.

Parameters:	arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly. keyword arguments (Remaining) – Passed to the constructor of Vocab.

dummy_model_input = (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))¶

numericalize(arr, device=None)[source]¶

Turn a batch of examples that use this field into a Variable.

If the field has include_lengths=True, a tensor of lengths will be included in the return value.

Parameters:

arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.

pad(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]¶

Pad a batch of examples using this field.

Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.

class pytext.fields.DocLabelField(**kwargs)[source]¶: Bases: pytext.fields.field.Field

class pytext.fields.Field(*args, **kwargs)[source]¶

Bases: torchtext.legacy.data.field.Field

classmethod from_config(config)[source]¶

get_meta() → pytext.fields.field.FieldMeta[source]¶

load_meta(metadata: pytext.fields.field.FieldMeta)[source]¶

pad_length(n)[source]¶: Override to make pad_length to be multiple of 8 to support fp16 training

class pytext.fields.FieldMeta[source]¶: Bases: object

class pytext.fields.FloatField(**kwargs)[source]¶: Bases: pytext.fields.field.Field

class pytext.fields.FloatVectorField(dim=0, dim_error_check=False, **kwargs)[source]¶: Bases: pytext.fields.field.Field

class pytext.fields.RawField(*args, is_target=False, **kwargs)[source]¶

Bases: torchtext.legacy.data.field.RawField

get_meta() → pytext.fields.field.FieldMeta[source]¶

class pytext.fields.TextFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField

dummy_model_input = tensor([[1], [1]])¶

class pytext.fields.VocabUsingField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶

Bases: pytext.fields.field.Field

Base class for all fields that need to build a vocabulary.

class pytext.fields.WordLabelField(use_bio_labels, **kwargs)[source]¶

Bases: pytext.fields.field.Field

get_meta()[source]¶

class pytext.fields.NestedField(*args, **kwargs)[source]¶

Bases: pytext.fields.field.Field, torchtext.legacy.data.field.NestedField

get_meta()[source]¶

load_meta(metadata: pytext.fields.field.FieldMeta)[source]¶

class pytext.fields.VocabUsingNestedField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingField, pytext.fields.field.NestedField

Base class for all nested fields that need to build a vocabulary.

class pytext.fields.SeqFeatureField(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]¶

Bases: pytext.fields.field.VocabUsingNestedField

dummy_model_input = tensor([[[1]], [[1]]])¶

class pytext.fields.TextFeatureFieldWithSpecialUnk(*args, unkify_func=<function unkify>, **kwargs)[source]¶

Bases: pytext.fields.field.TextFeatureField

build_vocab(*args, min_freq=1, **kwargs)[source]¶: Code is exactly same as as torchtext.legacy.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.legacy.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.

numericalize(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]¶: Code is exactly same as torchtext.legacy.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.