pytext.data.tokenizers package¶
Submodules¶
pytext.data.tokenizers.tokenizer module¶
-
class
pytext.data.tokenizers.tokenizer.BERTInitialTokenizer(basic_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.TokenizerBasic initial tokenization for BERT. This is run prior to word piece, does white space tokenization in addition to lower-casing and accent removal if specified.
-
class
pytext.data.tokenizers.tokenizer.CppProcessorMixin[source]¶ Bases:
objectCpp processors like SentencePiece don’t pickle well; reload them.
-
class
pytext.data.tokenizers.tokenizer.DoNothingTokenizer[source]¶ Bases:
pytext.data.tokenizers.tokenizer.TokenizerTokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand
-
class
pytext.data.tokenizers.tokenizer.GPT2BPETokenizer(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder, lowercase: bool = False)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.TokenizerTokenizer for gpt-2 and RoBERTa.
-
class
pytext.data.tokenizers.tokenizer.PickleableGPT2BPEEncoder(encoder, bpe_merges, errors='replace')[source]¶ Bases:
fairseq.data.encoders.gpt2_bpe_utils.EncoderFairseq’s encoder stores the regex module as a local reference on its encoders, which means they can’t be saved via pickle.dumps or torch.save. This modified their save/load logic doesn’t store the module, and restores the reference after re-inflating.
-
class
pytext.data.tokenizers.tokenizer.SentencePieceTokenizer(sp_model_path: str = '', max_input_text_length: Optional[int] = None)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer,pytext.data.tokenizers.tokenizer.CppProcessorMixinSentence piece tokenizer.
-
class
pytext.data.tokenizers.tokenizer.Token(value, start, end)[source]¶ Bases:
tuple-
end¶ Alias for field number 2
-
start¶ Alias for field number 1
-
value¶ Alias for field number 0
-
-
class
pytext.data.tokenizers.tokenizer.Tokenizer(split_regex='\s+', lowercase=True, use_byte_offsets=False)[source]¶ Bases:
pytext.config.component.ComponentA simple regex-splitting tokenizer.
-
class
pytext.data.tokenizers.tokenizer.WordPieceTokenizer(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.TokenizerWord piece tokenizer for BERT models.
Module contents¶
-
class
pytext.data.tokenizers.GPT2BPETokenizer(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder, lowercase: bool = False)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.TokenizerTokenizer for gpt-2 and RoBERTa.
-
class
pytext.data.tokenizers.Token(value, start, end)[source]¶ Bases:
tuple-
end¶ Alias for field number 2
-
start¶ Alias for field number 1
-
value¶ Alias for field number 0
-
-
class
pytext.data.tokenizers.Tokenizer(split_regex='\s+', lowercase=True, use_byte_offsets=False)[source]¶ Bases:
pytext.config.component.ComponentA simple regex-splitting tokenizer.
-
class
pytext.data.tokenizers.DoNothingTokenizer[source]¶ Bases:
pytext.data.tokenizers.tokenizer.TokenizerTokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand
-
class
pytext.data.tokenizers.WordPieceTokenizer(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.TokenizerWord piece tokenizer for BERT models.
-
class
pytext.data.tokenizers.CppProcessorMixin[source]¶ Bases:
objectCpp processors like SentencePiece don’t pickle well; reload them.
-
class
pytext.data.tokenizers.SentencePieceTokenizer(sp_model_path: str = '', max_input_text_length: Optional[int] = None)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer,pytext.data.tokenizers.tokenizer.CppProcessorMixinSentence piece tokenizer.