pytext.models.representations.transformer package¶
Submodules¶
pytext.models.representations.transformer.multihead_attention module¶
-
class
pytext.models.representations.transformer.multihead_attention.
MultiheadSelfAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.
-
forward
(query, key_padding_mask)[source]¶ Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.
-
pytext.models.representations.transformer.multihead_linear_attention module¶
-
class
pytext.models.representations.transformer.multihead_linear_attention.
MultiheadLinearAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of MultiheadLinearAttention: https://arxiv.org/pdf/2006.04768.pdf. from fairseq for the purposes of creating a productionized Linformer model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadLinearAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.
-
forward
(query, key_padding_mask)[source]¶ Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.
-
-
class
pytext.models.representations.transformer.multihead_linear_attention.
QuantizedMultiheadLinearAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]¶ Bases:
pytext.models.representations.transformer.multihead_linear_attention.MultiheadLinearAttention
pytext.models.representations.transformer.positional_embedding module¶
-
class
pytext.models.representations.transformer.positional_embedding.
PositionalEmbedding
(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.
This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
pytext.models.representations.transformer.representation module¶
-
class
pytext.models.representations.transformer.representation.
TransformerRepresentation
(config: pytext.models.representations.transformer.representation.TransformerRepresentation.Config, embed_dim: int)[source]¶ Bases:
pytext.models.module.Module
Representation consisting of stacked multi-head self-attention and position-wise feed-forward layers. Unlike Transformer, we assume inputs are already embedded, thus this representation can be used as a drop-in replacement for other temporal representations over text inputs (e.g., BiLSTM and DeepCNNDeepCNNRepresentation).
-
forward
(embedded_tokens: torch.Tensor, padding_mask: torch.Tensor) → torch.Tensor[source]¶ Forward inputs through the transformer layers.
Parameters: - embedded_tokens (B x T x H) – Tokens previously encoded with token,
- and segment embeddings. (positional,) –
- padding_mask (B x T) – Boolean mask specifying token positions that
- should not operate on. (self-attention) –
Returns: Final transformer layer state.
Return type: last_state (B x T x H)
-
pytext.models.representations.transformer.residual_mlp module¶
-
class
pytext.models.representations.transformer.residual_mlp.
GeLU
[source]¶ Bases:
torch.nn.modules.module.Module
Component class to wrap F.gelu.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.residual_mlp.
ResidualMLP
(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]¶ Bases:
torch.nn.modules.module.Module
A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.
Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
pytext.models.representations.transformer.sentence_encoder module¶
-
class
pytext.models.representations.transformer.sentence_encoder.
PostEncoder
(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]¶ Bases:
pytext.models.representations.transformer.sentence_encoder.SentenceEncoder
-
forward
(tokens: torch.Tensor, dense: List[torch.Tensor])[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.sentence_encoder.
SentenceEncoder
(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.
To use RoBERTa with this, download the RoBERTa public weights as roberta.weights
>>> encoder = SentenceEncoder() >>> weights = torch.load("roberta.weights") >>> encoder.load_roberta_state_dict(weights)
Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.
-
forward
(tokens)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
pytext.models.representations.transformer.sentence_encoder.
check_state_keys
(state, keys_regex)[source]¶ check if keys exists in state using full python paths
-
pytext.models.representations.transformer.sentence_encoder.
merge_input_projection
(state)[source]¶ New checkpoints of fairseq multihead attention split in_projections into k,v,q projections. This function merge them back to to make it compatible.
-
pytext.models.representations.transformer.sentence_encoder.
remove_state_keys
(state, keys_regex)[source]¶ Remove keys from state that match a regex
-
pytext.models.representations.transformer.sentence_encoder.
rename_component_from_root
(state, old_name, new_name)[source]¶ Rename keys from state using full python paths
pytext.models.representations.transformer.transformer module¶
-
class
pytext.models.representations.transformer.transformer.
SELFIETransformer
(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]¶ Bases:
pytext.models.representations.transformer.transformer.Transformer
-
forward
(tokens: torch.Tensor, dense: List[torch.Tensor]) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.transformer.
Transformer
(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(tokens: torch.Tensor) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.transformer.
TransformerLayer
(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(input, key_padding_mask)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
Module contents¶
This directory contains modules for implementing a productionized RoBERTa model. These modules implement the same Transformer components that are implemented in the fairseq library, however they’re distilled down to just the elements which are used in the final RoBERTa model, and within that are restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The SentenceEncoder specifically can be used to load model weights directly from the publicly release RoBERTa weights, and it will translate these weights to the corresponding values in this implementation.
-
class
pytext.models.representations.transformer.
MultiheadLinearAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of MultiheadLinearAttention: https://arxiv.org/pdf/2006.04768.pdf. from fairseq for the purposes of creating a productionized Linformer model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadLinearAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.
-
forward
(query, key_padding_mask)[source]¶ Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.
-
-
class
pytext.models.representations.transformer.
QuantizedMultiheadLinearAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]¶ Bases:
pytext.models.representations.transformer.multihead_linear_attention.MultiheadLinearAttention
-
class
pytext.models.representations.transformer.
MultiheadSelfAttention
(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.
-
forward
(query, key_padding_mask)[source]¶ Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.
-
-
class
pytext.models.representations.transformer.
PositionalEmbedding
(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.
This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
-
class
pytext.models.representations.transformer.
ResidualMLP
(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]¶ Bases:
torch.nn.modules.module.Module
A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.
Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
SentenceEncoder
(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]¶ Bases:
torch.nn.modules.module.Module
This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.
This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.
To use RoBERTa with this, download the RoBERTa public weights as roberta.weights
>>> encoder = SentenceEncoder() >>> weights = torch.load("roberta.weights") >>> encoder.load_roberta_state_dict(weights)
Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.
-
forward
(tokens)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
PostEncoder
(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]¶ Bases:
pytext.models.representations.transformer.sentence_encoder.SentenceEncoder
-
forward
(tokens: torch.Tensor, dense: List[torch.Tensor])[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
SELFIETransformer
(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]¶ Bases:
pytext.models.representations.transformer.transformer.Transformer
-
forward
(tokens: torch.Tensor, dense: List[torch.Tensor]) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
Transformer
(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(tokens: torch.Tensor) → List[torch.Tensor][source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
TransformerLayer
(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
-
forward
(input, key_padding_mask)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
-
class
pytext.models.representations.transformer.
TransformerRepresentation
(config: pytext.models.representations.transformer.representation.TransformerRepresentation.Config, embed_dim: int)[source]¶ Bases:
pytext.models.module.Module
Representation consisting of stacked multi-head self-attention and position-wise feed-forward layers. Unlike Transformer, we assume inputs are already embedded, thus this representation can be used as a drop-in replacement for other temporal representations over text inputs (e.g., BiLSTM and DeepCNNDeepCNNRepresentation).
-
forward
(embedded_tokens: torch.Tensor, padding_mask: torch.Tensor) → torch.Tensor[source]¶ Forward inputs through the transformer layers.
Parameters: - embedded_tokens (B x T x H) – Tokens previously encoded with token,
- and segment embeddings. (positional,) –
- padding_mask (B x T) – Boolean mask specifying token positions that
- should not operate on. (self-attention) –
Returns: Final transformer layer state.
Return type: last_state (B x T x H)
-