pytext.models.representations.transformer package

Submodules

pytext.models.representations.transformer.multihead_attention module

class pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.

forward(query, key_padding_mask)[source]

Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.

prune_multi_heads(heads: List[int])[source]

pytext.models.representations.transformer.multihead_linear_attention module

class pytext.models.representations.transformer.multihead_linear_attention.MultiheadLinearAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of MultiheadLinearAttention: https://arxiv.org/pdf/2006.04768.pdf. from fairseq for the purposes of creating a productionized Linformer model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadLinearAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.

forward(query, key_padding_mask)[source]

Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.

get_compressed_projection(k_input: torch.Tensor, v_input: torch.Tensor, target_length: int) → Tuple[torch.Tensor, torch.Tensor][source]
prune_multi_linear_heads(heads: List[int])[source]
class pytext.models.representations.transformer.multihead_linear_attention.QuantizedMultiheadLinearAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]

Bases: pytext.models.representations.transformer.multihead_linear_attention.MultiheadLinearAttention

get_compressed_projection(k_input: torch.Tensor, v_input: torch.Tensor, target_length: int) → Tuple[torch.Tensor, torch.Tensor][source]

pytext.models.representations.transformer.positional_embedding module

class pytext.models.representations.transformer.positional_embedding.PositionalEmbedding(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]

Bases: torch.nn.modules.module.Module

This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.

This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

forward(input)[source]

Input is expected to be of size [batch_size x sequence_length].

max_positions()[source]

Maximum number of supported positions.

pytext.models.representations.transformer.positional_embedding.make_positions(tensor, pad_index: int)[source]

Replace non-padding symbols with their position numbers. Position numbers begin at pad_index+1. Padding symbols are ignored.

pytext.models.representations.transformer.representation module

class pytext.models.representations.transformer.representation.TransformerRepresentation(config: pytext.models.representations.transformer.representation.TransformerRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.module.Module

Representation consisting of stacked multi-head self-attention and position-wise feed-forward layers. Unlike Transformer, we assume inputs are already embedded, thus this representation can be used as a drop-in replacement for other temporal representations over text inputs (e.g., BiLSTM and DeepCNNDeepCNNRepresentation).

forward(embedded_tokens: torch.Tensor, padding_mask: torch.Tensor) → torch.Tensor[source]

Forward inputs through the transformer layers.

Parameters:
  • embedded_tokens (B x T x H) – Tokens previously encoded with token,
  • and segment embeddings. (positional,) –
  • padding_mask (B x T) – Boolean mask specifying token positions that
  • should not operate on. (self-attention) –
Returns:

Final transformer layer state.

Return type:

last_state (B x T x H)

pytext.models.representations.transformer.residual_mlp module

class pytext.models.representations.transformer.residual_mlp.GeLU[source]

Bases: torch.nn.modules.module.Module

Component class to wrap F.gelu.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.residual_mlp.ResidualMLP(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]

Bases: torch.nn.modules.module.Module

A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.

Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

pytext.models.representations.transformer.sentence_encoder module

class pytext.models.representations.transformer.sentence_encoder.PostEncoder(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]

Bases: pytext.models.representations.transformer.sentence_encoder.SentenceEncoder

extract_features(tokens: torch.Tensor, dense: List[torch.Tensor])[source]
forward(tokens: torch.Tensor, dense: List[torch.Tensor])[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.sentence_encoder.SentenceEncoder(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.

To use RoBERTa with this, download the RoBERTa public weights as roberta.weights

>>> encoder = SentenceEncoder()
>>> weights = torch.load("roberta.weights")
>>> encoder.load_roberta_state_dict(weights)

Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.

extract_features(tokens)[source]
forward(tokens)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

load_roberta_state_dict(state_dict)[source]
pytext.models.representations.transformer.sentence_encoder.check_state_keys(state, keys_regex)[source]

check if keys exists in state using full python paths

pytext.models.representations.transformer.sentence_encoder.merge_input_projection(state)[source]

New checkpoints of fairseq multihead attention split in_projections into k,v,q projections. This function merge them back to to make it compatible.

pytext.models.representations.transformer.sentence_encoder.remove_state_keys(state, keys_regex)[source]

Remove keys from state that match a regex

pytext.models.representations.transformer.sentence_encoder.rename_component_from_root(state, old_name, new_name)[source]

Rename keys from state using full python paths

pytext.models.representations.transformer.sentence_encoder.rename_state_keys(state, keys_regex, replacement)[source]

Rename keys from state that match a regex; replacement can use capture groups

pytext.models.representations.transformer.sentence_encoder.translate_roberta_state_dict(state_dict)[source]

Translate the public RoBERTa weights to ones which match SentenceEncoder.

pytext.models.representations.transformer.transformer module

class pytext.models.representations.transformer.transformer.SELFIETransformer(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]

Bases: pytext.models.representations.transformer.transformer.Transformer

forward(tokens: torch.Tensor, dense: List[torch.Tensor]) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.transformer.Transformer(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(tokens: torch.Tensor) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.transformer.TransformerLayer(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(input, key_padding_mask)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Module contents

This directory contains modules for implementing a productionized RoBERTa model. These modules implement the same Transformer components that are implemented in the fairseq library, however they’re distilled down to just the elements which are used in the final RoBERTa model, and within that are restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The SentenceEncoder specifically can be used to load model weights directly from the publicly release RoBERTa weights, and it will translate these weights to the corresponding values in this implementation.

class pytext.models.representations.transformer.MultiheadLinearAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of MultiheadLinearAttention: https://arxiv.org/pdf/2006.04768.pdf. from fairseq for the purposes of creating a productionized Linformer model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadLinearAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.

forward(query, key_padding_mask)[source]

Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.

get_compressed_projection(k_input: torch.Tensor, v_input: torch.Tensor, target_length: int) → Tuple[torch.Tensor, torch.Tensor][source]
prune_multi_linear_heads(heads: List[int])[source]
class pytext.models.representations.transformer.QuantizedMultiheadLinearAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1, compress_layer=None, bias: bool = True)[source]

Bases: pytext.models.representations.transformer.multihead_linear_attention.MultiheadLinearAttention

get_compressed_projection(k_input: torch.Tensor, v_input: torch.Tensor, target_length: int) → Tuple[torch.Tensor, torch.Tensor][source]
class pytext.models.representations.transformer.MultiheadSelfAttention(embed_dim: int, num_heads: int, scaling: float = 0.125, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of MultiheadAttention from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

The default constructor values match those required to import the public RoBERTa weights. Unless you are pretraining your own model, there’s no need to change them.

forward(query, key_padding_mask)[source]

Input shape: Time x Batch x Channel Timesteps can be masked by supplying a T x T mask in the attn_mask argument. Padding elements can be excluded from the key by passing a binary ByteTensor (key_padding_mask) with shape: batch x source_length, where padding elements are indicated by 1s.

prune_multi_heads(heads: List[int])[source]
class pytext.models.representations.transformer.PositionalEmbedding(num_embeddings: int, embedding_dim: int, pad_index: Optional[int] = None)[source]

Bases: torch.nn.modules.module.Module

This module learns positional embeddings up to a fixed maximum size. Padding ids are ignored by either offsetting based on pad_index or by setting pad_index to None and ensuring that the appropriate position ids are passed to the forward function.

This is a TorchScriptable implementation of PositionalEmbedding from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa use cases of MultiheadAttention, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

forward(input)[source]

Input is expected to be of size [batch_size x sequence_length].

max_positions()[source]

Maximum number of supported positions.

class pytext.models.representations.transformer.ResidualMLP(input_dim: int, hidden_dims: List[int], dropout: float = 0.1, activation=<class 'pytext.models.representations.transformer.residual_mlp.GeLU'>)[source]

Bases: torch.nn.modules.module.Module

A square MLP component which can learn a bias on an input vector. This MLP in particular defaults to using GeLU as its activation function (this can be changed by passing a different activation function), and retains a residual connection to its original input to help with gradient propogation.

Unlike pytext’s MLPDecoder it doesn’t currently allow adding a LayerNorm in between hidden layers.

forward(input)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.SentenceEncoder(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]

Bases: torch.nn.modules.module.Module

This is a TorchScriptable implementation of RoBERTa from fairseq for the purposes of creating a productionized RoBERTa model. It distills just the elements which are required to implement the RoBERTa model, and within that is restructured and rewritten to be able to be compiled by TorchScript for production use cases.

This SentenceEncoder can load in the public RoBERTa weights directly with load_roberta_state_dict, which will translate the keys as they exist in the publicly released RoBERTa to the correct structure for this implementation. The default constructor value will have the same size and shape as that model.

To use RoBERTa with this, download the RoBERTa public weights as roberta.weights

>>> encoder = SentenceEncoder()
>>> weights = torch.load("roberta.weights")
>>> encoder.load_roberta_state_dict(weights)

Within this you will still need to preprocess inputs using fairseq and the publicly released vocabs, and finally place this encoder in a model alongside say an MLP output layer to do classification.

extract_features(tokens)[source]
forward(tokens)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

load_roberta_state_dict(state_dict)[source]
class pytext.models.representations.transformer.PostEncoder(transformer: Optional[pytext.models.representations.transformer.transformer.Transformer] = None)[source]

Bases: pytext.models.representations.transformer.sentence_encoder.SentenceEncoder

extract_features(tokens: torch.Tensor, dense: List[torch.Tensor])[source]
forward(tokens: torch.Tensor, dense: List[torch.Tensor])[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.SELFIETransformer(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]

Bases: pytext.models.representations.transformer.transformer.Transformer

forward(tokens: torch.Tensor, dense: List[torch.Tensor]) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.Transformer(vocab_size: int = 50265, embedding_dim: int = 768, padding_idx: int = 1, max_seq_len: int = 514, layers: List[pytext.models.representations.transformer.transformer.TransformerLayer] = (), dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(tokens: torch.Tensor) → List[torch.Tensor][source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.TransformerLayer(embedding_dim: int = 768, attention: Optional[pytext.models.representations.transformer.multihead_attention.MultiheadSelfAttention] = None, residual_mlp: Optional[pytext.models.representations.transformer.residual_mlp.ResidualMLP] = None, dropout: float = 0.1)[source]

Bases: torch.nn.modules.module.Module

forward(input, key_padding_mask)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class pytext.models.representations.transformer.TransformerRepresentation(config: pytext.models.representations.transformer.representation.TransformerRepresentation.Config, embed_dim: int)[source]

Bases: pytext.models.module.Module

Representation consisting of stacked multi-head self-attention and position-wise feed-forward layers. Unlike Transformer, we assume inputs are already embedded, thus this representation can be used as a drop-in replacement for other temporal representations over text inputs (e.g., BiLSTM and DeepCNNDeepCNNRepresentation).

forward(embedded_tokens: torch.Tensor, padding_mask: torch.Tensor) → torch.Tensor[source]

Forward inputs through the transformer layers.

Parameters:
  • embedded_tokens (B x T x H) – Tokens previously encoded with token,
  • and segment embeddings. (positional,) –
  • padding_mask (B x T) – Boolean mask specifying token positions that
  • should not operate on. (self-attention) –
Returns:

Final transformer layer state.

Return type:

last_state (B x T x H)