ByteTokenTensorizer.Config¶

Component: ByteTokenTensorizer

class ByteTokenTensorizer.Config[source]: Bases: Tensorizer.Config

All Attributes (including base classes)

is_input: bool = True

column: str = 'text'

The name of the text column to parse from the data source.

tokenizer: Tokenizer.Config = Tokenizer.Config()

The tokenizer to use to split input text into tokens.

max_seq_len: Optional[int] = None

The max token length for input text.

max_byte_len: int = 15

The max byte length for a token.

offset_for_non_padding: int = 0

Offset to add to all non-padding bytes

add_bos_token: bool = False

add_eos_token: bool = False

use_eos_token_for_bos: bool = False

Default JSON

{
    "is_input": true,
    "column": "text",
    "tokenizer": {
        "Tokenizer": {
            "split_regex": "\\s+",
            "lowercase": true,
            "use_byte_offsets": false
        }
    },
    "max_seq_len": null,
    "max_byte_len": 15,
    "offset_for_non_padding": 0,
    "add_bos_token": false,
    "add_eos_token": false,
    "use_eos_token_for_bos": false
}