pytext.torchscript.tokenizer package

Submodules

pytext.torchscript.tokenizer.bpe module

class pytext.torchscript.tokenizer.bpe.ScriptBPE(vocab: Dict[str, int], eow: str = '_EOW')[source]

Bases: torch.jit.ScriptModule

Byte-pair encoding implementation in TorchScript.

vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.

eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.

>>> import io
>>> vocab_file = io.StringIO('''
hello_EOW 20
world_EOW 18
th  17
is_EOW 16
bpe_EOW 15
! 14
h 13
t 6
s_EOW 2
i -1
ii -2
''')
>>> bpe = ScriptBPE.from_vocab_file(vocab_file)
>>> bpe.tokenize(["hello", "world", "this", "is", "bpe"])
["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"]
>>> bpe.tokenize(["iiiis"])
["ii", "i", "is_EOW"]
classmethod from_vocab_file(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
classmethod from_vocab_filename(vocab_filename: str) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
static load_vocab(file: io.IOBase) → Dict[str, int][source]

pytext.torchscript.tokenizer.tokenizer module

class pytext.torchscript.tokenizer.tokenizer.ScriptBPETokenizer(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.tokenizer.ScriptDoNothingTokenizer(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.tokenizer.ScriptTextTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

class pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

class pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: torch.jit.ScriptModule

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

Module contents

class pytext.torchscript.tokenizer.ScriptBPE(vocab: Dict[str, int], eow: str = '_EOW')[source]

Bases: torch.jit.ScriptModule

Byte-pair encoding implementation in TorchScript.

vocab_file should be a file-like object separated by newlines, where each line consists of a word and a count separated by whitespace. Words in the vocab therefore can’t contain space (according to python regex s). The vocab file should be sorted according to the importance of each token, and they will be merged in this priority; the actual score values are irrelevant.

eow_token should be a string that is appended to the last character and token, and that token is used at each step in the process and returned at the end. You should set this to be consistent with the EOW signature used however you generated your ScriptBPE vocab file.

>>> import io
>>> vocab_file = io.StringIO('''
hello_EOW 20
world_EOW 18
th  17
is_EOW 16
bpe_EOW 15
! 14
h 13
t 6
s_EOW 2
i -1
ii -2
''')
>>> bpe = ScriptBPE.from_vocab_file(vocab_file)
>>> bpe.tokenize(["hello", "world", "this", "is", "bpe"])
["hello_EOW", "world_EOW", "th", "is_EOW", "is_EOW", "bpe_EOW"]
>>> bpe.tokenize(["iiiis"])
["ii", "i", "is_EOW"]
classmethod from_vocab_file(vocab_file: io.IOBase) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
classmethod from_vocab_filename(vocab_filename: str) → pytext.torchscript.tokenizer.bpe.ScriptBPE[source]
static load_vocab(file: io.IOBase) → Dict[str, int][source]
class pytext.torchscript.tokenizer.ScriptBPETokenizer(bpe: pytext.torchscript.tokenizer.bpe.ScriptBPE)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.ScriptDoNothingTokenizer(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenTokenizerBase

class pytext.torchscript.tokenizer.ScriptTextTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]

class pytext.torchscript.tokenizer.ScriptTokenTokenizerBase(optimize=None, _qualified_name=None, _compilation_unit=None, _cpp_module=None)[source]

Bases: pytext.torchscript.tokenizer.tokenizer.ScriptTokenizerBase

input_type() → pytext.torchscript.utils.ScriptInputType[source]

Determine TorchScript module input type, currently it have four types 1) text: batch with a single text in each row, List[str] 2) tokens: batch with a list of tokens from single text in each row, List[List[str]] 3) multi_text: batch with multiple texts in each row, List[List[str]] 4) multi_tokens: batch with multiple lists of tokens from multiple texts in each row, List[List[List[str]]]