pytext.torchscript.tokenizer package¶
Submodules¶
pytext.torchscript.tokenizer.bpe module¶
-
class
pytext.torchscript.tokenizer.bpe.
ScriptBPE
(vocab: Dict[str, int], eow: str = '_EOW')[source]¶ Bases:
torch.jit.ScriptModule
Byte-pair encoding implementation in TorchScript.
vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.
eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.
>>> import io >>> vocab_file = io.StringIO(''' hello_EOW 20 world_EOW 18 th 17 is_EOW 16 bpe_EOW 15 ! 14 h 13 t 6 s_EOW 2 i -1 ii -2 ''') >>> bpe = ScriptBPE.from_vocab_file(vocab_file) >>> bpe.tokenize(["hello", "world", "this", "is", "bpe"]) ["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"] >>> bpe.tokenize(["iiiis"]) ["ii", "i", "is_EOW"]
-
classmethod
from_vocab_file
(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶
-
classmethod
pytext.torchscript.tokenizer.tokenizer module¶
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptBPETokenizer
(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptDoNothingTokenizer
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptTextTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptTokenTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
torch.jit.ScriptModule
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
Module contents¶
-
class
pytext.torchscript.tokenizer.
ScriptBPE
(vocab: Dict[str, int], eow: str = '_EOW')[source]¶ Bases:
torch.jit.ScriptModule
Byte-pair encoding implementation in TorchScript.
vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.
eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.
>>> import io >>> vocab_file = io.StringIO(''' hello_EOW 20 world_EOW 18 th 17 is_EOW 16 bpe_EOW 15 ! 14 h 13 t 6 s_EOW 2 i -1 ii -2 ''') >>> bpe = ScriptBPE.from_vocab_file(vocab_file) >>> bpe.tokenize(["hello", "world", "this", "is", "bpe"]) ["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"] >>> bpe.tokenize(["iiiis"]) ["ii", "i", "is_EOW"]
-
classmethod
from_vocab_file
(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶
-
classmethod
-
class
pytext.torchscript.tokenizer.
ScriptBPETokenizer
(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.
ScriptDoNothingTokenizer
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase
-
class
pytext.torchscript.tokenizer.
ScriptTextTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-
-
class
pytext.torchscript.tokenizer.
ScriptTokenTokenizerBase
(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
input_type
() → pytext.torchscript.utils.ScriptInputType[source]¶ Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]
-