pytext.models.embeddings package¶

Submodules¶

pytext.models.embeddings.char_embedding module¶

class pytext.models.embeddings.char_embedding.CharacterEmbedding(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], highway_layers: int, projection_dim: Optional[int], *args, **kwargs)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.

Implementation is loosely based on https://arxiv.org/abs/1508.06615.

Parameters:

num_embeddings (int) – Total number of characters (vocabulary size).
embed_dim (int) – Size of character embeddings to be passed to convolutions.
out_channels (int) – Number of output channels.
kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
highway_layers (int) – Number of highway layers applied to pooled output.
projection_dim (int) – If specified, size of output embedding for token, via a linear projection from convolution output.

char_embed¶

Character embedding table.

Type:	nn.Embedding

convs¶

Convolution layers that operate on character

Type:	nn.ModuleList

embeddings.

highway_layers¶

Highway layers on top of convolution output.

Type:	nn.Module

projection¶

Final linear layer to token embedding.

Type:	nn.Module

embedding_dim¶

Dimension of the final token embedding produced.

Type:	int

forward(chars: torch.Tensor) → torch.Tensor[source]¶

Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.

Parameters:	chars (torch.Tensor) – Batch of sentences where each token is broken characters. (into) – Dimension – batch size X maximum sentence length X maximum word length
Returns:	Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))
Return type:	torch.Tensor

classmethod from_config(config: pytext.config.field_config.CharFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, vocab_size: Optional[int] = None)[source]¶

Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding. metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of CharacterEmbedding.
Return type:	type

class pytext.models.embeddings.char_embedding.Highway(input_dim: int, num_layers: int = 1)[source]¶

Bases: torch.nn.modules.module.Module

A Highway layer <https://arxiv.org/abs/1505.00387>. Adopted from the AllenNLP implementation.

forward(x: torch.Tensor)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

reset_parameters()[source]¶

pytext.models.embeddings.contextual_token_embedding module¶

class pytext.models.embeddings.contextual_token_embedding.ContextualTokenEmbedding(embed_dim: int, downsample_dim: Optional[int] = None)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for providing token embeddings from a pretrained model.

forward(embedding: torch.Tensor) → torch.Tensor[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.ContextualTokenEmbeddingConfig, *args, **kwargs)[source]¶

pytext.models.embeddings.dict_embedding module¶

class pytext.models.embeddings.dict_embedding.DictEmbedding(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, pad_index: int = 1, unk_index: int = 0, mobile: bool = False)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be

[
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.

Parameters:	num_embeddings (int) – Total number of dictionary features (vocabulary size). embed_dim (int) – Size of embedding vector. pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.

pooling_type¶

Type of pooling for combining the dictionary feature embeddings.

Type:	PoolingType

find_and_replace(tensor: torch.Tensor, find_val: int, replace_val: int) → torch.Tensor[source]¶: torch.where is not supported for mobile ONNX, this hack allows a mobile exported version of torch.where which is computationally more expensive

forward(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]¶

Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.

Parameters:	feats (torch.Tensor) – Batch of sentences with dictionary feature ids. shape: [bsz, seq_len * max_feat_per_token] weights (torch.Tensor) – Batch of sentences with dictionary feature weights for the dictionary features. shape: [bsz, seq_len * max_feat_per_token] lengths (torch.Tensor) – Batch of sentences with the number of dictionary features per token. shape: [bsz, seq_len]
Returns:	Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.
Return type:	torch.Tensor

classmethod from_config(config: pytext.config.field_config.DictFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None)[source]¶

Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (DictFeatConfig) – Configuration object specifying all the of DictEmbedding. (parameters) – metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of DictEmbedding.
Return type:	type

pytext.models.embeddings.embedding_base module¶

class pytext.models.embeddings.embedding_base.EmbeddingBase(embedding_dim: int)[source]¶

Bases: pytext.models.module.Module

Base class for token level embedding modules.

Parameters:	embedding_dim (int) – Size of embedding vector.

num_emb_modules¶

Number of ways to embed a token.

Type:	int

embedding_dim¶

Size of embedding vector.

Type:	int

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

pytext.models.embeddings.embedding_list module¶

class pytext.models.embeddings.embedding_list.EmbeddingList(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.container.ModuleList

There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.

Parameters:	embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to a token. (embed) – concat (bool) – Whether to concatenate the embedding vectors emitted from modules. (embeddings) –

num_emb_modules¶

Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total

Type:	int

input_start_indices¶

List of indices of the sub-embeddings in the embedding list.

Type:	List[int]

concat¶

Whether to concatenate the embedding vectors emitted from embeddings modules.

Type:	bool

embedding_dim¶: Total embedding size, can be a single int or tuple of int depending on concat setting

forward(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]¶

Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.

Parameters:	emb_input (type*) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors.
Returns:	If concat is True then a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type:	Union[torch.Tensor, Tuple[torch.Tensor]]

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

pytext.models.embeddings.mlp_embedding module¶

class pytext.models.embeddings.mlp_embedding.MLPEmbedding(embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, init_std: Optional[float] = None, mlp_layer_dims: List[int] = ())[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

An MLP embedding wrapper module around torch.nn.Embedding to add transformations for float tensors.

Parameters:

num_embeddings (int) – Total number of words/tokens (vocabulary size).
embedding_dim (int) – Size of embedding vector.
embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.

forward(input)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.MLPFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]¶

Factory method to construct an instance of MLPEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (MLPFeatConfig) – Configuration object specifying all the of MLPEmbedding. (parameters) – metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of MLPEmbedding.
Return type:	type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

pytext.models.embeddings.scriptable_embedding_list module¶

class pytext.models.embeddings.scriptable_embedding_list.ScriptableEmbeddingList(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase])[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

This class is a Torchscript-friendly version of pytext.models.embeddings.EmbeddingList. The main differences are that it requires input arguments to be passed in as a list of Tensors, since Torchscript does not allow variable arguments, and that it only supports concat mode, since Torchscript does not support return value variance.

class Wrapper1(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase)[source]¶

Bases: torch.nn.modules.module.Module

forward(xs: List[torch.Tensor])[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class Wrapper3(embedding: pytext.models.embeddings.embedding_base.EmbeddingBase)[source]¶

Bases: torch.nn.modules.module.Module

forward(xs: List[torch.Tensor])[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

forward(emb_input: List[List[torch.Tensor]]) → torch.Tensor[source]¶

Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.

Parameters:	emb_input (type) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is a List of Tensors.
Returns:	a Tensor is returned by concatenating all embeddings.
Return type:	torch.Tensor

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

pytext.models.embeddings.word_embedding module¶

class pytext.models.embeddings.word_embedding.WordEmbedding(num_embeddings: int, embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, init_std: Optional[float] = None, unk_token_idx: int = 0, mlp_layer_dims: List[int] = (), padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.

Note: Embedding weights for UNK token are always initialized to zeros.

Parameters:

num_embeddings (int) – Total number of words/tokens (vocabulary size).
embedding_dim (int) – Size of embedding vector.
embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
unk_token_idx (int) – Index of UNK token in the word vocabulary.
mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.

forward(input)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

freeze()[source]¶

classmethod from_config(config: pytext.config.field_config.WordFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]¶

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (WordFeatConfig) – Configuration object specifying all the of WordEmbedding. (parameters) – metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of WordEmbedding.
Return type:	type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

pytext.models.embeddings.word_seq_embedding module¶

class pytext.models.embeddings.word_seq_embedding.WordSeqEmbedding(lstm_config: pytext.models.representations.bilstm.BiLSTM.Config, num_embeddings: int, word_embed_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, init_std: Optional[float] = None, unk_token_idx: int = 0, padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

An embedding module represents a sequence of sentences

Parameters:

lstm_config (BiLSTM.Config) – config of the lstm layer
num_embeddings (int) – Total number of words/tokens (vocabulary size).
embedding_dim (int) – Size of embedding vector.
embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
unk_token_idx (int) – Index of UNK token in the word vocabulary.

forward(seq_token_idx, seq_token_count)[source]¶

Parameters:	seq_token_idx – shape [batch_size * max_seq_len * max_token_count] seq_token_count – shape [batch_size * max_seq_len]
Returns:	shape (batch_size * max_seq_len * output_dim)
Return type:	embedding

freeze()[source]¶

classmethod from_config(config: pytext.models.embeddings.word_seq_embedding.WordSeqEmbedding.Config, tensorizer: pytext.data.tensorizers.Tensorizer = None, init_from_saved_state: Optional[bool] = False)[source]¶

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (WordSeqEmbedding.Config) – Configuration object specifying all the of WordEmbedding. (parameters) –
Returns:	An instance of WordSeqEmbedding.
Return type:	type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

Module contents¶

class pytext.models.embeddings.EmbeddingBase(embedding_dim: int)[source]¶

Bases: pytext.models.module.Module

Base class for token level embedding modules.

Parameters:	embedding_dim (int) – Size of embedding vector.

num_emb_modules¶

Number of ways to embed a token.

Type:	int

embedding_dim¶

Size of embedding vector.

Type:	int

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

class pytext.models.embeddings.EmbeddingList(embeddings: Iterable[pytext.models.embeddings.embedding_base.EmbeddingBase], concat: bool)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase, torch.nn.modules.container.ModuleList

There are more than one way to embed a token and this module provides a way to generate a list of sub-embeddings, concat embedding tensors into a single Tensor or return a tuple of Tensors that can be used by downstream modules.

Parameters:	embeddings (Iterable[EmbeddingBase]) – A sequence of embedding modules to a token. (embed) – concat (bool) – Whether to concatenate the embedding vectors emitted from modules. (embeddings) –

num_emb_modules¶

Number of flattened embeddings in embeddings, e.g: ((e1, e2), e3) has 3 in total

Type:	int

input_start_indices¶

List of indices of the sub-embeddings in the embedding list.

Type:	List[int]

concat¶

Whether to concatenate the embedding vectors emitted from embeddings modules.

Type:	bool

embedding_dim¶: Total embedding size, can be a single int or tuple of int depending on concat setting

forward(*emb_input) → Union[torch.Tensor, Tuple[torch.Tensor]][source]¶

Get embeddings from all sub-embeddings and either concatenate them into one Tensor or return them in a tuple.

Parameters:	emb_input (type*) – Sequence of token level embeddings to combine. The inputs should match the size of configured embeddings. Each of them is either a Tensor or a tuple of Tensors.
Returns:	If concat is True then a Tensor is returned by concatenating all embeddings. Otherwise all embeddings are returned in a tuple.
Return type:	Union[torch.Tensor, Tuple[torch.Tensor]]

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

class pytext.models.embeddings.WordEmbedding(num_embeddings: int, embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, init_std: Optional[float] = None, unk_token_idx: int = 0, mlp_layer_dims: List[int] = (), padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

A word embedding wrapper module around torch.nn.Embedding with options to initialize the word embedding weights and add MLP layers acting on each word.

Note: Embedding weights for UNK token are always initialized to zeros.

Parameters:

num_embeddings (int) – Total number of words/tokens (vocabulary size).
embedding_dim (int) – Size of embedding vector.
embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
unk_token_idx (int) – Index of UNK token in the word vocabulary.
mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.

forward(input)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

freeze()[source]¶

classmethod from_config(config: pytext.config.field_config.WordFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]¶

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (WordFeatConfig) – Configuration object specifying all the of WordEmbedding. (parameters) – metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of WordEmbedding.
Return type:	type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

class pytext.models.embeddings.DictEmbedding(num_embeddings: int, embed_dim: int, pooling_type: pytext.config.module_config.PoolingType, pad_index: int = 1, unk_index: int = 0, mobile: bool = False)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for dictionary feature embeddings for tokens. Dictionary features are also known as gazetteer features. These are per token discrete features that the module learns embeddings for. Example: For the utterance Order coffee from Starbucks, the dictionary features could be

[
    {"tokenIdx": 1, "features": {"drink/beverage": 0.8, "music/song": 0.2}},
    {"tokenIdx": 3, "features": {"store/coffee_shop": 1.0}}
]

:: Thus, for a given token there can be more than one dictionary features each of which has a confidence score. The final embedding for a token is the weighted average of the dictionary embeddings followed by a pooling operation such that the module produces an embedding vector per token.

Parameters:	num_embeddings (int) – Total number of dictionary features (vocabulary size). embed_dim (int) – Size of embedding vector. pooling_type (PoolingType) – Type of pooling for combining the dictionary feature embeddings.

pooling_type¶

Type of pooling for combining the dictionary feature embeddings.

Type:	PoolingType

find_and_replace(tensor: torch.Tensor, find_val: int, replace_val: int) → torch.Tensor[source]¶: torch.where is not supported for mobile ONNX, this hack allows a mobile exported version of torch.where which is computationally more expensive

forward(feats: torch.Tensor, weights: torch.Tensor, lengths: torch.Tensor) → torch.Tensor[source]¶

Given a batch of sentences such containing dictionary feature ids per token, produce token embedding vectors for each sentence in the batch.

Parameters:	feats (torch.Tensor) – Batch of sentences with dictionary feature ids. shape: [bsz, seq_len * max_feat_per_token] weights (torch.Tensor) – Batch of sentences with dictionary feature weights for the dictionary features. shape: [bsz, seq_len * max_feat_per_token] lengths (torch.Tensor) – Batch of sentences with the number of dictionary features per token. shape: [bsz, seq_len]
Returns:	Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = embed_dim passed to the constructor.
Return type:	torch.Tensor

classmethod from_config(config: pytext.config.field_config.DictFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, labels: Optional[pytext.data.utils.Vocabulary] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None)[source]¶

Factory method to construct an instance of DictEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (DictFeatConfig) – Configuration object specifying all the of DictEmbedding. (parameters) – metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of DictEmbedding.
Return type:	type

class pytext.models.embeddings.CharacterEmbedding(num_embeddings: int, embed_dim: int, out_channels: int, kernel_sizes: List[int], highway_layers: int, projection_dim: Optional[int], *args, **kwargs)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for character aware CNN embeddings for tokens. It uses convolution followed by max-pooling over character embeddings to obtain an embedding vector for each token.

Implementation is loosely based on https://arxiv.org/abs/1508.06615.

Parameters:

num_embeddings (int) – Total number of characters (vocabulary size).
embed_dim (int) – Size of character embeddings to be passed to convolutions.
out_channels (int) – Number of output channels.
kernel_sizes (List[int]) – Dimension of input Tensor passed to MLP.
highway_layers (int) – Number of highway layers applied to pooled output.
projection_dim (int) – If specified, size of output embedding for token, via a linear projection from convolution output.

char_embed¶

Character embedding table.

Type:	nn.Embedding

convs¶

Convolution layers that operate on character

Type:	nn.ModuleList

embeddings.

highway_layers¶

Highway layers on top of convolution output.

Type:	nn.Module

projection¶

Final linear layer to token embedding.

Type:	nn.Module

embedding_dim¶

Dimension of the final token embedding produced.

Type:	int

forward(chars: torch.Tensor) → torch.Tensor[source]¶

Given a batch of sentences such that tokens are broken into character ids, produce token embedding vectors for each sentence in the batch.

Parameters:	chars (torch.Tensor) – Batch of sentences where each token is broken characters. (into) – Dimension – batch size X maximum sentence length X maximum word length
Returns:	Embedded batch of sentences. Dimension: batch size X maximum sentence length, token embedding size. Token embedding size = out_channels * len(self.convs))
Return type:	torch.Tensor

classmethod from_config(config: pytext.config.field_config.CharFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, vocab_size: Optional[int] = None)[source]¶

Factory method to construct an instance of CharacterEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (CharFeatConfig) – Configuration object specifying all the parameters of CharacterEmbedding. metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of CharacterEmbedding.
Return type:	type

class pytext.models.embeddings.ContextualTokenEmbedding(embed_dim: int, downsample_dim: Optional[int] = None)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

Module for providing token embeddings from a pretrained model.

forward(embedding: torch.Tensor) → torch.Tensor[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.ContextualTokenEmbeddingConfig, *args, **kwargs)[source]¶

class pytext.models.embeddings.WordSeqEmbedding(lstm_config: pytext.models.representations.bilstm.BiLSTM.Config, num_embeddings: int, word_embed_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, init_std: Optional[float] = None, unk_token_idx: int = 0, padding_idx: Optional[int] = None, vocab: Optional[List[str]] = None)[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

An embedding module represents a sequence of sentences

Parameters:

lstm_config (BiLSTM.Config) – config of the lstm layer
num_embeddings (int) – Total number of words/tokens (vocabulary size).
embedding_dim (int) – Size of embedding vector.
embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
unk_token_idx (int) – Index of UNK token in the word vocabulary.

forward(seq_token_idx, seq_token_count)[source]¶

Parameters:	seq_token_idx – shape [batch_size * max_seq_len * max_token_count] seq_token_count – shape [batch_size * max_seq_len]
Returns:	shape (batch_size * max_seq_len * output_dim)
Return type:	embedding

freeze()[source]¶

classmethod from_config(config: pytext.models.embeddings.word_seq_embedding.WordSeqEmbedding.Config, tensorizer: pytext.data.tensorizers.Tensorizer = None, init_from_saved_state: Optional[bool] = False)[source]¶

Factory method to construct an instance of WordEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (WordSeqEmbedding.Config) – Configuration object specifying all the of WordEmbedding. (parameters) –
Returns:	An instance of WordSeqEmbedding.
Return type:	type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space

class pytext.models.embeddings.MLPEmbedding(embedding_dim: int = 300, embeddings_weight: Optional[torch.Tensor] = None, init_range: Optional[List[int]] = None, init_std: Optional[float] = None, mlp_layer_dims: List[int] = ())[source]¶

Bases: pytext.models.embeddings.embedding_base.EmbeddingBase

An MLP embedding wrapper module around torch.nn.Embedding to add transformations for float tensors.

Parameters:

num_embeddings (int) – Total number of words/tokens (vocabulary size).
embedding_dim (int) – Size of embedding vector.
embeddings_weight (torch.Tensor) – Pretrained weights to initialize the embedding table with.
init_range (List[int]) – Range of uniform distribution to initialize the weights with if embeddings_weight is None.
mlp_layer_dims (List[int]) – List of layer dimensions (if any) to add on top of the embedding lookup.

forward(input)[source]¶

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_config(config: pytext.config.field_config.MLPFeatConfig, metadata: Optional[pytext.fields.field.FieldMeta] = None, tensorizer: Optional[pytext.data.tensorizers.Tensorizer] = None, init_from_saved_state: Optional[bool] = False)[source]¶

Factory method to construct an instance of MLPEmbedding from the module’s config object and the field’s metadata object.

Parameters:	config (MLPFeatConfig) – Configuration object specifying all the of MLPEmbedding. (parameters) – metadata (FieldMeta) – Object containing this field’s metadata.
Returns:	An instance of MLPEmbedding.
Return type:	type

visualize(summary_writer: <Mock name='mock.SummaryWriter' id='140050279154320'>)[source]¶: Overridden in sub classes to implement Tensorboard visualization of embedding space