pytext.data.tokenizers package¶
Submodules¶
pytext.data.tokenizers.tokenizer module¶
-
class
pytext.data.tokenizers.tokenizer.
BERTInitialTokenizer
(basic_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Basic initial tokenization for BERT. This is run prior to word piece, does white space tokenization in addition to lower-casing and accent removal if specified.
-
class
pytext.data.tokenizers.tokenizer.
CppProcessorMixin
[source]¶ Bases:
object
Cpp processors like SentencePiece don’t pickle well; reload them.
-
class
pytext.data.tokenizers.tokenizer.
DoNothingTokenizer
[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand
-
class
pytext.data.tokenizers.tokenizer.
GPT2BPETokenizer
(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder, lowercase: bool = False)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer for gpt-2 and RoBERTa.
-
class
pytext.data.tokenizers.tokenizer.
PickleableGPT2BPEEncoder
(encoder, bpe_merges, errors='replace')[source]¶ Bases:
fairseq.data.encoders.gpt2_bpe_utils.Encoder
Fairseq’s encoder stores the regex module as a local reference on its encoders, which means they can’t be saved via pickle.dumps or torch.save. This modified their save/load logic doesn’t store the module, and restores the reference after re-inflating.
-
class
pytext.data.tokenizers.tokenizer.
SentencePieceTokenizer
(sp_model_path: str = '', max_input_text_length: Optional[int] = None)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
,pytext.data.tokenizers.tokenizer.CppProcessorMixin
Sentence piece tokenizer.
-
class
pytext.data.tokenizers.tokenizer.
Token
(value, start, end)[source]¶ Bases:
tuple
-
end
¶ Alias for field number 2
-
start
¶ Alias for field number 1
-
value
¶ Alias for field number 0
-
-
class
pytext.data.tokenizers.tokenizer.
Tokenizer
(split_regex='\s+', lowercase=True, use_byte_offsets=False)[source]¶ Bases:
pytext.config.component.Component
A simple regex-splitting tokenizer.
-
class
pytext.data.tokenizers.tokenizer.
WordPieceTokenizer
(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Word piece tokenizer for BERT models.
Module contents¶
-
class
pytext.data.tokenizers.
GPT2BPETokenizer
(bpe: fairseq.data.encoders.gpt2_bpe_utils.Encoder, lowercase: bool = False)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer for gpt-2 and RoBERTa.
-
class
pytext.data.tokenizers.
Token
(value, start, end)[source]¶ Bases:
tuple
-
end
¶ Alias for field number 2
-
start
¶ Alias for field number 1
-
value
¶ Alias for field number 0
-
-
class
pytext.data.tokenizers.
Tokenizer
(split_regex='\s+', lowercase=True, use_byte_offsets=False)[source]¶ Bases:
pytext.config.component.Component
A simple regex-splitting tokenizer.
-
class
pytext.data.tokenizers.
DoNothingTokenizer
[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Tokenizer that takes a list of strings and converts to a list of Tokens. Useful in cases where tokenizer is run before-hand
-
class
pytext.data.tokenizers.
WordPieceTokenizer
(wordpiece_vocab, basic_tokenizer, wordpiece_tokenizer)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
Word piece tokenizer for BERT models.
-
class
pytext.data.tokenizers.
CppProcessorMixin
[source]¶ Bases:
object
Cpp processors like SentencePiece don’t pickle well; reload them.
-
class
pytext.data.tokenizers.
SentencePieceTokenizer
(sp_model_path: str = '', max_input_text_length: Optional[int] = None)[source]¶ Bases:
pytext.data.tokenizers.tokenizer.Tokenizer
,pytext.data.tokenizers.tokenizer.CppProcessorMixin
Sentence piece tokenizer.