Tokenizer.Config¶

Component: Tokenizer

class Tokenizer.Config[source]: Bases: Component.Config

All Attributes (including base classes)

split_regex: str = '\\s+'

A regular expression for the tokenizer to split on. Tokens are the segments between the regular expression matches. The start index is inclusive of the unmatched region, and the end index is exclusive (matching the first character of the matched split region).

lowercase: bool = True

Whether token values should be lowercased or not.

use_byte_offsets: bool = False

Whether to use utf8 byte offsets

Subclasses

BERTInitialTokenizer.Config

Default JSON

{
    "split_regex": "\\s+",
    "lowercase": true,
    "use_byte_offsets": false
}