Tokenizer.ConfigΒΆ

Component: Tokenizer

class Tokenizer.Config[source]

Bases: Component.Config

All Attributes (including base classes)

split_regex: str = '\\s+'
A regular expression for the tokenizer to split on. Tokens are the segments between the regular expression matches. The start index is inclusive of the unmatched region, and the end index is exclusive (matching the first character of the matched split region).
lowercase: bool = True
Whether token values should be lowercased or not.
use_byte_offsets: bool = False
Whether to use utf8 byte offsets
Subclasses
  • BERTInitialTokenizer.Config

Default JSON

{
    "split_regex": "\\s+",
    "lowercase": true,
    "use_byte_offsets": false
}