pytext.torchscript.tokenizer package¶

Submodules¶

pytext.torchscript.tokenizer.bpe module¶

class pytext.torchscript.tokenizer.bpe.ScriptBPE(vocab: Dict[str, int], eow: str = '_EOW')[source]¶

Bases: torch.jit._script.ScriptModule

Byte-pair encoding implementation in TorchScript.

vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.

eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.

>>> import io
>>> vocab_file = io.StringIO('''
hello_EOW 20
world_EOW 18
th  17
is_EOW 16
bpe_EOW 15
! 14
h 13
t 6
s_EOW 2
i -1
ii -2
''')
>>> bpe = ScriptBPE.from_vocab_file(vocab_file)
>>> bpe.tokenize(["hello", "world", "this", "is", "bpe"])
["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"]
>>> bpe.tokenize(["iiiis"])
["ii", "i", "is_EOW"]

classmethod from_vocab_file(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶

classmethod from_vocab_filename(vocab_filename: str) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶

static load_vocab(file: io.IOBase) → Dict[str, int][source]¶

pytext.torchscript.tokenizer.tokenizer module¶

class pytext.torchscript.tokenizer.tokenizer.ScriptBPETokenizer(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶: Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

class pytext.torchscript.tokenizer.tokenizer.ScriptDoNothingTokenizer[source]¶: Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

class pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase[source]¶: Bases: torch.jit._script.ScriptModule

class pytext.torchscript.tokenizer.tokenizer.ScriptWordTokenizer(lowercase=True)[source]¶: Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

Module contents¶

class pytext.torchscript.tokenizer.ScriptBPE(vocab: Dict[str, int], eow: str = '_EOW')[source]¶

Bases: torch.jit._script.ScriptModule

Byte-pair encoding implementation in TorchScript.

vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.

eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.

>>> import io
>>> vocab_file = io.StringIO('''
hello_EOW 20
world_EOW 18
th  17
is_EOW 16
bpe_EOW 15
! 14
h 13
t 6
s_EOW 2
i -1
ii -2
''')
>>> bpe = ScriptBPE.from_vocab_file(vocab_file)
>>> bpe.tokenize(["hello", "world", "this", "is", "bpe"])
["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"]
>>> bpe.tokenize(["iiiis"])
["ii", "i", "is_EOW"]

classmethod from_vocab_file(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶

classmethod from_vocab_filename(vocab_filename: str) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]¶

static load_vocab(file: io.IOBase) → Dict[str, int][source]¶

class pytext.torchscript.tokenizer.ScriptBPETokenizer(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]¶: Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

class pytext.torchscript.tokenizer.ScriptDoNothingTokenizer[source]¶: Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

class pytext.torchscript.tokenizer.ScriptTokenizerBase[source]¶: Bases: torch.jit._script.ScriptModule

class pytext.torchscript.tokenizer.ScriptWordTokenizer(lowercase=True)[source]¶: Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase