Tokenizer.ConfigΒΆ
Component: Tokenizer
-
class
Tokenizer.
Config
[source] Bases:
Component.Config
All Attributes (including base classes)
- split_regex: str =
'\\s+'
- A regular expression for the tokenizer to split on. Tokens are the segments between the regular expression matches. The start index is inclusive of the unmatched region, and the end index is exclusive (matching the first character of the matched split region).
- lowercase: bool =
True
- Whether token values should be lowercased or not.
- use_byte_offsets: bool =
False
- Whether to use utf8 byte offsets
- Subclasses
BERTInitialTokenizer.Config
Default JSON
{
"split_regex": "\\s+",
"lowercase": true,
"use_byte_offsets": false
}