pytext.fields package¶
Submodules¶
pytext.fields.char_field module¶
-
class
pytext.fields.char_field.
CharFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= tensor([[[1, 1, 1]], [[1, 1, 1]]])¶
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pytext.fields.contextual_token_embedding_field module¶
-
class
pytext.fields.contextual_token_embedding_field.
ContextualTokenEmbeddingField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]¶ Example of padded minibatch:
[[[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [3.1, 3.2, 3.3, 3.4, 3.5], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [0.0, 0.0, 0.0, 0.0, 0.0], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], ], ]
-
pytext.fields.dict_field module¶
-
class
pytext.fields.dict_field.
DictFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))¶
-
numericalize
(arr, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]¶ Pad a batch of examples using this field.
Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.
-
pytext.fields.field module¶
-
class
pytext.fields.field.
DocLabelField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.field.
FloatField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.field.
FloatVectorField
(dim=0, dim_error_check=False, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.field.
NestedField
(*args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
,torchtext.legacy.data.field.NestedField
-
class
pytext.fields.field.
RawField
(*args, is_target=False, **kwargs)[source]¶ Bases:
torchtext.legacy.data.field.RawField
-
class
pytext.fields.field.
SeqFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingNestedField
-
dummy_model_input
= tensor([[[1]], [[1]]])¶
-
-
class
pytext.fields.field.
TextFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
dummy_model_input
= tensor([[1], [1]])¶
-
-
class
pytext.fields.field.
VocabUsingField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
Base class for all fields that need to build a vocabulary.
-
class
pytext.fields.field.
VocabUsingNestedField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
,pytext.fields.field.NestedField
Base class for all nested fields that need to build a vocabulary.
-
class
pytext.fields.field.
WordLabelField
(use_bio_labels, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
pytext.fields.text_field_with_special_unk module¶
-
class
pytext.fields.text_field_with_special_unk.
TextFeatureFieldWithSpecialUnk
(*args, unkify_func=<function unkify>, **kwargs)[source]¶ Bases:
pytext.fields.field.TextFeatureField
-
build_vocab
(*args, min_freq=1, **kwargs)[source]¶ Code is exactly same as as torchtext.legacy.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.legacy.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.
-
numericalize
(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]¶ Code is exactly same as torchtext.legacy.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.
-
Module contents¶
-
class
pytext.fields.
CharFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, max_word_length=20, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= tensor([[[1, 1, 1]], [[1, 1, 1]]])¶
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
-
class
pytext.fields.
ContextualTokenEmbeddingField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
numericalize
(batch, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[List[List[float]]]) → List[List[List[float]]][source]¶ Example of padded minibatch:
[[[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [3.1, 3.2, 3.3, 3.4, 3.5], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [2.1, 2.2, 2.3, 2.4, 2.5], [0.0, 0.0, 0.0, 0.0, 0.0], ], [[0.1, 0.2, 0.3, 0.4, 0.5], [1.1, 1.2, 1.3, 1.4, 1.5], [0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0], ], ]
-
-
class
pytext.fields.
DictFeatureField
(pad_token='<pad>', unk_token='<unk>', batch_first=True, left_pad=False, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
Parameters: - arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
dummy_model_input
= (tensor([[1], [1]]), tensor([[1.5000], [2.5000]]), tensor([[1], [1]]))¶
-
numericalize
(arr, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
Parameters: - arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
- device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch: List[Tuple[List[int], List[float], List[int]]]) → Tuple[List[List[int]], List[List[float]], List[int]][source]¶ Pad a batch of examples using this field.
Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.
-
-
class
pytext.fields.
DocLabelField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
FloatField
(**kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
FloatVectorField
(dim=0, dim_error_check=False, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
RawField
(*args, is_target=False, **kwargs)[source]¶ Bases:
torchtext.legacy.data.field.RawField
-
class
pytext.fields.
TextFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, batch_first=True, sequential=True, pad_token='<pad>', unk_token='<unk>', init_token=None, eos_token=None, lower=False, tokenize=<function no_tokenize>, fix_length=None, pad_first=None, min_freq=1, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
-
dummy_model_input
= tensor([[1], [1]])¶
-
-
class
pytext.fields.
VocabUsingField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
Base class for all fields that need to build a vocabulary.
-
class
pytext.fields.
WordLabelField
(use_bio_labels, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
-
class
pytext.fields.
NestedField
(*args, **kwargs)[source]¶ Bases:
pytext.fields.field.Field
,torchtext.legacy.data.field.NestedField
-
class
pytext.fields.
VocabUsingNestedField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, min_freq=1, *args, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingField
,pytext.fields.field.NestedField
Base class for all nested fields that need to build a vocabulary.
-
class
pytext.fields.
SeqFeatureField
(pretrained_embeddings_path='', embed_dim=0, embedding_init_strategy=<EmbedInitStrategy.RANDOM: 'random'>, vocab_file='', vocab_size='', vocab_from_train_data=True, vocab_from_all_data=False, vocab_from_pretrained_embeddings=False, postprocessing=None, use_vocab=True, include_lengths=True, pad_token='<pad_seq>', init_token=None, eos_token=None, tokenize=<function no_tokenize>, nesting_field=None, **kwargs)[source]¶ Bases:
pytext.fields.field.VocabUsingNestedField
-
dummy_model_input
= tensor([[[1]], [[1]]])¶
-
-
class
pytext.fields.
TextFeatureFieldWithSpecialUnk
(*args, unkify_func=<function unkify>, **kwargs)[source]¶ Bases:
pytext.fields.field.TextFeatureField
-
build_vocab
(*args, min_freq=1, **kwargs)[source]¶ Code is exactly same as as torchtext.legacy.data.Field.build_vocab() before the UNKification logic. The reason super().build_vocab() cannot be called is because the Counter object computed in torchtext.legacy.data.Field.build_vocab() is required for UNKification and, that object cannot be recovered after super().build_vocab() call is made.
-
numericalize
(arr: Union[List[List[str]], Tuple[List[List[str]], List[int]]], device: Union[str, torch.device, None] = None)[source]¶ Code is exactly same as torchtext.legacy.data.Field.numericalize() except the call to self._get_idx(x) instead of self.vocab.stoi[x] for getting the index of an item from vocab. This is needed because torchtext doesn’t allow custom UNKification. So, TextFeatureFieldWithSpecialUnk field’s constructor accepts a function unkify_func() that can be used to UNKifying instead of assigning all UNKs a default value.
-