pytext.data.featurizer package

Submodules

pytext.data.featurizer.featurizer module

class pytext.data.featurizer.featurizer.Featurizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.config.component.Component

Featurizer is tasked with performing data preprocessing that should be shared between training and inference, namely, tokenization and gazetteer features alignment.

This is an interface whose featurize() method must be implemented so that the implemented interface can be used with the appropriate data handler.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]
featurize_batch(input_record_list: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

classmethod from_config(config, feature_config: pytext.config.field_config.FeatureConfig)[source]
get_sentence_markers(locale=None)[source]
class pytext.data.featurizer.featurizer.InputRecord[source]

Bases: tuple

Input data contract between Featurizer and DataHandler.

locale

Alias for field number 2

raw_gazetteer_feats

Alias for field number 1

raw_text

Alias for field number 0

class pytext.data.featurizer.featurizer.OutputRecord[source]

Bases: tuple

Output data contract between Featurizer and DataHandler.

characters

Alias for field number 5

contextual_token_embedding

Alias for field number 6

dense_feats

Alias for field number 7

gazetteer_feat_lengths

Alias for field number 3

gazetteer_feat_weights

Alias for field number 4

gazetteer_feats

Alias for field number 2

token_ranges

Alias for field number 1

tokens

Alias for field number 0

pytext.data.featurizer.simple_featurizer module

class pytext.data.featurizer.simple_featurizer.SimpleFeaturizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.data.featurizer.featurizer.Featurizer

Simple featurizer for basic tokenization and gazetteer feature alignment.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Featurize one instance/example only.

featurize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

get_sentence_markers(locale=None)[source]
tokenize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Tokenize one instance/example only.

tokenize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Module contents

class pytext.data.featurizer.Featurizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.config.component.Component

Featurizer is tasked with performing data preprocessing that should be shared between training and inference, namely, tokenization and gazetteer features alignment.

This is an interface whose featurize() method must be implemented so that the implemented interface can be used with the appropriate data handler.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]
featurize_batch(input_record_list: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

classmethod from_config(config, feature_config: pytext.config.field_config.FeatureConfig)[source]
get_sentence_markers(locale=None)[source]
class pytext.data.featurizer.InputRecord[source]

Bases: tuple

Input data contract between Featurizer and DataHandler.

locale

Alias for field number 2

raw_gazetteer_feats

Alias for field number 1

raw_text

Alias for field number 0

class pytext.data.featurizer.OutputRecord[source]

Bases: tuple

Output data contract between Featurizer and DataHandler.

characters

Alias for field number 5

contextual_token_embedding

Alias for field number 6

dense_feats

Alias for field number 7

gazetteer_feat_lengths

Alias for field number 3

gazetteer_feat_weights

Alias for field number 4

gazetteer_feats

Alias for field number 2

token_ranges

Alias for field number 1

tokens

Alias for field number 0

class pytext.data.featurizer.SimpleFeaturizer(config, feature_config: pytext.config.field_config.FeatureConfig)[source]

Bases: pytext.data.featurizer.featurizer.Featurizer

Simple featurizer for basic tokenization and gazetteer feature alignment.

featurize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Featurize one instance/example only.

featurize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]

Featurize a batch of instances/examples.

get_sentence_markers(locale=None)[source]
tokenize(input_record: pytext.data.featurizer.featurizer.InputRecord) → pytext.data.featurizer.featurizer.OutputRecord[source]

Tokenize one instance/example only.

tokenize_batch(input_records: Sequence[pytext.data.featurizer.featurizer.InputRecord]) → Sequence[pytext.data.featurizer.featurizer.OutputRecord][source]