pytext.torchscript.tokenizer package¶
Submodules¶
pytext.torchscript.tokenizer.bpe module¶
-
class
pytext.torchscript.tokenizer.bpe.
ScriptBPE
(vocab: Dict[str, int], eow: str = '_EOW')[source]¶ Bases:
torch.jit._script.ScriptModule
Byte-pair encoding implementation in TorchScript.
vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.
eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.
>>> import io >>> vocab_file = io.StringIO(''' hello_EOW 20 world_EOW 18 th 17 is_EOW 16 bpe_EOW 15 ! 14 h 13 t 6 s_EOW 2 i -1 ii -2 ''') >>> bpe = ScriptBPE.from_vocab_file(vocab_file) >>> bpe.tokenize(["hello", "world", "this", "is", "bpe"]) ["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"] >>> bpe.tokenize(["iiiis"]) ["ii", "i", "is_EOW"]
-
classmethod
from_vocab_file
(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶
-
classmethod
pytext.torchscript.tokenizer.tokenizer module¶
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptBPETokenizer
(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptDoNothingTokenizer
[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptTokenizerBase
[source]¶ Bases:
torch.jit._script.ScriptModule
-
class
pytext.torchscript.tokenizer.tokenizer.
ScriptWordTokenizer
(lowercase=True)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
Module contents¶
-
class
pytext.torchscript.tokenizer.
ScriptBPE
(vocab: Dict[str, int], eow: str = '_EOW')[source]¶ Bases:
torch.jit._script.ScriptModule
Byte-pair encoding implementation in TorchScript.
vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.
eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.
>>> import io >>> vocab_file = io.StringIO(''' hello_EOW 20 world_EOW 18 th 17 is_EOW 16 bpe_EOW 15 ! 14 h 13 t 6 s_EOW 2 i -1 ii -2 ''') >>> bpe = ScriptBPE.from_vocab_file(vocab_file) >>> bpe.tokenize(["hello", "world", "this", "is", "bpe"]) ["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"] >>> bpe.tokenize(["iiiis"]) ["ii", "i", "is_EOW"]
-
classmethod
from_vocab_file
(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶
-
classmethod
-
class
pytext.torchscript.tokenizer.
ScriptBPETokenizer
(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
class
pytext.torchscript.tokenizer.
ScriptDoNothingTokenizer
[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase
-
class
pytext.torchscript.tokenizer.
ScriptTokenizerBase
[source]¶ Bases:
torch.jit._script.ScriptModule
-
class
pytext.torchscript.tokenizer.
ScriptWordTokenizer
(lowercase=True)[source]¶ Bases:
pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase