pytext.data.tokenizers package

Submodules

pytext.data.tokenizers.tokenizer module

class pytext.data.tokenizers.tokenizer.BERTInitialTokenizer(basic_tokenizer)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Basic initial tokenization for BERT. This is run prior to word piece, does white space tokenization in addition to lower-casing and accent removal if specified.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.BERTInitialTokenizer.Config)[source]
tokenize(text)[source]

Tokenizes a piece of text.

class pytext.data.tokenizers.tokenizer.CppProcessorMixin[source]

Bases: object

Cpp processors like SentencePiece don’t pickle well; reload them.

class pytext.data.tokenizers.tokenizer.DoNothingTokenizer[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand

classmethod from_config(config: pytext.data.tokenizers.tokenizer.DoNothingTokenizer.Config)[source]
tokenize(tokens: Union[List[str], str]) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.tokenizer.GPT2BPETokenizer(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder, lowercase: bool = False)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer for gpt-2 and RoBERTa.

decode(sentence: str)[source]
classmethod from_config(config: pytext.data.tokenizers.tokenizer.GPT2BPETokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
class pytext.data.tokenizers.tokenizer.PickleableGPT2BPEEncoder(encoder, bpe_merges, errors='replace')[source]

Bases: fairseq.data.encoders.gpt2_bpe_utils.Encoder

Fairseq’s encoder stores the regex module as a local reference on its encoders, which means they can’t be saved via pickle.dumps or torch.save. This modified their save/load logic doesn’t store the module, and restores the reference after re-inflating.

class pytext.data.tokenizers.tokenizer.SentencePieceTokenizer(sp_model_path: str = '', max_input_text_length: Optional[int] = None)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer, pytext.data.tokenizers.tokenizer.CppProcessorMixin

Sentence piece tokenizer.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.SentencePieceTokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.tokenizer.Token(value, start, end)[source]

Bases: tuple

end

Alias for field number 2

start

Alias for field number 1

value

Alias for field number 0

class pytext.data.tokenizers.tokenizer.Tokenizer(split_regex='\s+', lowercase=True, use_byte_offsets=False)[source]

Bases: pytext.config.component.Component

A simple regex-splitting tokenizer.

decode(sentence: str)[source]
classmethod from_config(config: pytext.data.tokenizers.tokenizer.Tokenizer.Config)[source]
tokenize(input: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.tokenizer.WordPieceTokenizer(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Word piece tokenizer for BERT models.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.WordPieceTokenizer.Config)[source]
static load_vocab(vocab_file)[source]

Loads a vocabulary file into a dictionary.

tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]

Module contents

class pytext.data.tokenizers.GPT2BPETokenizer(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder, lowercase: bool = False)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer for gpt-2 and RoBERTa.

decode(sentence: str)[source]
classmethod from_config(config: pytext.data.tokenizers.tokenizer.GPT2BPETokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
class pytext.data.tokenizers.Token(value, start, end)[source]

Bases: tuple

end

Alias for field number 2

start

Alias for field number 1

value

Alias for field number 0

class pytext.data.tokenizers.Tokenizer(split_regex='\s+', lowercase=True, use_byte_offsets=False)[source]

Bases: pytext.config.component.Component

A simple regex-splitting tokenizer.

decode(sentence: str)[source]
classmethod from_config(config: pytext.data.tokenizers.tokenizer.Tokenizer.Config)[source]
tokenize(input: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.DoNothingTokenizer[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand

classmethod from_config(config: pytext.data.tokenizers.tokenizer.DoNothingTokenizer.Config)[source]
tokenize(tokens: Union[List[str], str]) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]
class pytext.data.tokenizers.WordPieceTokenizer(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer

Word piece tokenizer for BERT models.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.WordPieceTokenizer.Config)[source]
static load_vocab(vocab_file)[source]

Loads a vocabulary file into a dictionary.

tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
class pytext.data.tokenizers.CppProcessorMixin[source]

Bases: object

Cpp processors like SentencePiece don’t pickle well; reload them.

class pytext.data.tokenizers.SentencePieceTokenizer(sp_model_path: str = '', max_input_text_length: Optional[int] = None)[source]

Bases: pytext.data.tokenizers.tokenizer.Tokenizer, pytext.data.tokenizers.tokenizer.CppProcessorMixin

Sentence piece tokenizer.

classmethod from_config(config: pytext.data.tokenizers.tokenizer.SentencePieceTokenizer.Config)[source]
tokenize(input_str: str) → List[pytext.data.tokenizers.tokenizer.Token][source]
torchscriptify()[source]