pytext.data.sources package¶

Submodules¶

pytext.data.sources.conllu module¶

class pytext.data.sources.conllu.CoNLLUNERDataSource(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶

Bases: pytext.data.sources.conllu.CoNLLUPOSDataSource

Reads an empty line separated data (word label). This data source supports datasets for NER tasks

class pytext.data.sources.conllu.CoNLLUNERFile(file, delim, lang)[source]¶: Bases: object

class pytext.data.sources.conllu.CoNLLUPOSDataSource(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from CoNLL-U file.

classmethod from_config(config: pytext.data.sources.conllu.CoNLLUPOSDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶

raw_eval_data_generator()[source]¶: Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]¶: Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]¶: Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

pytext.data.sources.data_source module¶

class pytext.data.sources.data_source.DataSource(schema: Dict[str, Type[CT_co]])[source]¶

Bases: pytext.config.component.Component

Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.

Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.

eval = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

test = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

train = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

class pytext.data.sources.data_source.GeneratorIterator(generator, *args, **kwargs)[source]¶

Bases: object

Create an object which can be iterated over multiple times from a generator call. Each iteration will call the generator and allow iterating over it. This is unsafe to use on generators which have side effects, such as file readers; it’s up to the callers to safely manage these scenarios.

class pytext.data.sources.data_source.GeneratorMethodProperty(generator)[source]¶

Bases: object

Identify a generator method as a property. This will allow instances to iterate over the property multiple times, and not consume the generator. It accomplishes this by wrapping the generator and creating multiple generator instances if iterated over multiple times.

class pytext.data.sources.data_source.RawExample[source]¶

Bases: dict

A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.

class pytext.data.sources.data_source.RootDataSource(schema: Dict[str, Type[CT_co]], column_mapping: Dict[str, str] = ())[source]¶

Bases: pytext.data.sources.data_source.DataSource

A data source which actually loads data from a location. This data source needs to be responsible for converting types based on a schema, because it should be the only part of the system that actually needs to understand details about the underlying storage system.

RootDataSource presents a simpler abstraction than DataSource where the rows are automatically converted to the right DataTypes.

A RootDataSource should implement raw_train_data_generator, raw_test_data_generator, and raw_eval_data_generator. These functions should yield dictionaries of raw objects which the loading system can convert using the schema loading functions.

DATA_SOURCE_TYPES = {<class 'str'>: <function load_text>, typing.Any: <function load_text>, typing.List[pytext.utils.data.Slot]: <function load_slots>, typing.List[int]: <function load_json>, typing.List[str]: <function load_json>, typing.List[typing.Dict[str, typing.Dict[str, float]]]: <function load_json>, typing.List[float]: <function load_float_list>, ~JSONString: <function load_json_string>, <class 'float'>: <function load_float>, <class 'int'>: <function load_int>}¶

eval = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

load(value, schema_type)[source]¶

raw_eval_data_generator()[source]¶: Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]¶: Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]¶: Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

classmethod register_type(type)[source]¶

test = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

train = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

class pytext.data.sources.data_source.RowShardedDataSource(data_source: pytext.data.sources.data_source.DataSource, rank=0, world_size=1)[source]¶

Bases: pytext.data.sources.data_source.ShardedDataSource

Shards a given datasource by row.

train = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

train_unsharded = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

class pytext.data.sources.data_source.SafeFileWrapper(*args, **kwargs)[source]¶

Bases: object

A simple wrapper class for files which allows filedescriptors to be managed with normal Python ref counts. Without using this, if you create a file in a from_config you will see a warning along the lines of “ResourceWarning: self._file is acquired but not always released” this is because we’re opening a file not in a context manager (with statement). We want to do it this way because it lets us pass a file object to the DataSource, rather than a filename. This exposes a ton more flexibility and testability, passing filenames is one of the paths towards pain.

However, we don’t have a clear resource management system set up for configuration. from_config functions are the tool that we have to allow objects to specify how they should be created from a configuration, which generally should only happen from the command line, whereas in eg. a notebook you should build the objects with constructors directly. If building from constructors, you can just open a file and pass it, but from_config here needs to create a file object from a configured filename. Python files don’t close automatically, so you also need a system that will close them when the python interpreter shuts down. If you don’t, it will print a resource warning at runtime, as the interpreter manually closes the filehandles (although modern OSs are pretty okay with having open file handles, it’s hard for me to justify exactly why Python is so strict about this; I think one of the main reasons you might actually care is if you have a writeable file handle it might not have flushed properly when the C runtime exits, but Python doesn’t actually distinguish between writeable and non-writeable file handles).

This class is a wrapper that creates a system for (sort-of) safely closing the file handles before the runtime exits. It does this by closing the file when the object’s deleter is called. Although the python standard doesn’t actually make any guarantees about when deleters are called, CPython is reference counted and so as an mplementation detail will call a deleter whenever the last reference to it is removed, which generally will happen to all objects created during program execution as long as there aren’t reference cycles (I don’t actually know off-hand whether the cycle collection is run before shutdown, and anyway the cycles would have to include objects that the runtime itself maintains pointers to, which seems like you’d have to work hard to do and wouldn’t do accidentally). This isn’t true for other python systems like PyPy or Jython which use generational garbage collection and so don’t actually always call destructors before the system shuts down, but again this is only really relevant for mutable files.

An alternative implementation would be to build a resource management system into PyText, something like a function that we use for opening system resources that registers the resources and then we make sure are all closed before system shutdown. That would probably technically be the right solution, but I didn’t really think of that first and also it’s a bit longer to implement.

If you are seeing resource warnings on your system, please file a github issue.

class pytext.data.sources.data_source.ShardedDataSource(schema: Dict[str, Type[CT_co]])[source]¶

Bases: pytext.data.sources.data_source.DataSource

Base class for sharded data sources.

pytext.data.sources.data_source.generator_property¶: alias of pytext.data.sources.data_source.GeneratorMethodProperty

pytext.data.sources.data_source.load_float(f)[source]¶

pytext.data.sources.data_source.load_float_list(s)[source]¶

pytext.data.sources.data_source.load_int(x)[source]¶

pytext.data.sources.data_source.load_json(s)[source]¶

pytext.data.sources.data_source.load_json_string(s)[source]¶

pytext.data.sources.data_source.load_slots(s)[source]¶

pytext.data.sources.data_source.load_text(s)[source]¶

pytext.data.sources.dense_retrieval module¶

class pytext.data.sources.dense_retrieval.DenseRetrievalDataSource(schema, train_filename=None, test_filename=None, eval_filename=None, num_negative_ctxs=1, use_title=True, use_cache=False)[source]¶

Bases: pytext.data.sources.data_source.DataSource

Data source for DPR (https://github.com/facebookresearch/DPR).

Expects multiline json for lazy loading and improved memory usage. The original DPR files can be converted to multiline json using jq -c .[]

DEFAULT_SCHEMA = {'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>}¶

eval = <pytext.data.sources.data_source.GeneratorIterator object>¶

classmethod from_config(config: pytext.data.sources.dense_retrieval.DenseRetrievalDataSource.Config, schema={'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>})[source]¶

process_file(fname, is_train)[source]¶

read_file(fname)[source]¶

test = <pytext.data.sources.data_source.GeneratorIterator object>¶

train = <pytext.data.sources.data_source.GeneratorIterator object>¶

pytext.data.sources.dense_retrieval.combine_title_text_id(ctx, use_title)[source]¶

pytext.data.sources.pandas module¶

class pytext.data.sources.pandas.PandasDataSource(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]¶

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from a pandas DataFrame.

Inputs:

train_df: DataFrame for training

eval_df: DataFrame for evalu

test_df: DataFrame for test

schema: same as base DataSource, define the list of output values with their types

column_mapping: maps the column names in DataFrame to the name defined in schema

classmethod from_config(config: pytext.data.sources.pandas.PandasDataSource.Config, schema: Dict[str, Type[CT_co]])[source]¶

raw_eval_data_generator()[source]¶: Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

static raw_generator(df: Optional[pandas.core.frame.DataFrame])[source]¶

raw_test_data_generator()[source]¶: Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]¶: Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

class pytext.data.sources.pandas.SessionPandasDataSource(schema: Dict[str, Type[CT_co]], id_col: str, train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, column_mapping: Dict[str, str] = ())[source]¶: Bases: pytext.data.sources.pandas.PandasDataSource, pytext.data.sources.session.SessionDataSource

pytext.data.sources.session module¶

class pytext.data.sources.session.SessionDataSource(id_col, **kwargs)[source]¶

Bases: pytext.data.sources.data_source.RootDataSource

Data source for session based data, the input data is organized in sessions, each session may have multiple rows. The first column is always the session id. Raw input rows are consolidated by session id and returned as one session per example

merge_session(session)[source]¶

pytext.data.sources.squad module¶

class pytext.data.sources.squad.SquadDataSource(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶

Bases: pytext.data.sources.data_source.DataSource

Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)

DEFAULT_SCHEMA = {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}¶

eval = <pytext.data.sources.data_source.GeneratorIterator object>¶

classmethod from_config(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶

process_file(fname)[source]¶

process_squad_json(fname)[source]¶

process_squad_tsv(fname)[source]¶

test = <pytext.data.sources.data_source.GeneratorIterator object>¶

train = <pytext.data.sources.data_source.GeneratorIterator object>¶

class pytext.data.sources.squad.SquadDataSourceForKD(**kwargs)[source]¶

Bases: pytext.data.sources.squad.SquadDataSource

Squad-like data along with soft labels (logits). Will return tuples of ( doc, question, answer, answer_start, has_answer, start_logits, end_logits, has_answer_logits, pad_mask, segment_labels )

process_squad_tsv(fname)[source]¶

pytext.data.sources.tsv module¶

class pytext.data.sources.tsv.BlockShardedTSV(file, field_names=None, delimiter='t', quoted=False, block_id=0, num_blocks=1, drop_incomplete_rows=False)[source]¶

Bases: object

Take a TSV file, split into N pieces (by byte location) and return an iterator on one of the pieces. The pieces are equal by byte size, not by number of rows. Thus, care needs to be taken when using this for distributed training, otherwise number of batches for different workers might be different.

class pytext.data.sources.tsv.BlockShardedTSVDataSource(rank=0, world_size=1, **kwargs)[source]¶

Bases: pytext.data.sources.tsv.TSVDataSource, pytext.data.sources.data_source.ShardedDataSource

train_unsharded = <pytext.data.sources.data_source.GeneratorIterator object>¶

class pytext.data.sources.tsv.MultilingualTSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', data_source_languages={'eval': ['en'], 'test': ['en'], 'train': ['en']}, language_columns=['language'], **kwargs)[source]¶

Bases: pytext.data.sources.tsv.TSVDataSource

Data Source for multi-lingual data. The input data can have multiple text fields and each field can either have the same language or different languages. The data_source_languages dict contains the language information for each text field and this should match the number of language identifiers specified in language_columns.

eval = <pytext.data.sources.data_source.GeneratorIterator object>¶

test = <pytext.data.sources.data_source.GeneratorIterator object>¶

train = <pytext.data.sources.data_source.GeneratorIterator object>¶

class pytext.data.sources.tsv.SessionTSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, **kwargs)[source]¶: Bases: pytext.data.sources.tsv.TSVDataSource, pytext.data.sources.session.SessionDataSource

class pytext.data.sources.tsv.TSV(file, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False)[source]¶: Bases: object

class pytext.data.sources.tsv.TSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]¶

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from TSV sources. Uses python’s csv library.

classmethod from_config(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶

raw_eval_data_generator()[source]¶: Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]¶: Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]¶: Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

Module contents¶

class pytext.data.sources.DataSource(schema: Dict[str, Type[CT_co]])[source]¶

Bases: pytext.config.component.Component

Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.

Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.

eval = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

test = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

train = <pytext.data.sources.data_source.GeneratorIterator object>[source]¶

class pytext.data.sources.RawExample[source]¶

Bases: dict

A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.

class pytext.data.sources.SquadDataSource(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶

Bases: pytext.data.sources.data_source.DataSource

Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)

DEFAULT_SCHEMA = {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}¶

eval = <pytext.data.sources.data_source.GeneratorIterator object>¶

classmethod from_config(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶

process_file(fname)[source]¶

process_squad_json(fname)[source]¶

process_squad_tsv(fname)[source]¶

test = <pytext.data.sources.data_source.GeneratorIterator object>¶

train = <pytext.data.sources.data_source.GeneratorIterator object>¶

class pytext.data.sources.TSVDataSource(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]¶

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from TSV sources. Uses python’s csv library.

classmethod from_config(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶

raw_eval_data_generator()[source]¶: Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_test_data_generator()[source]¶: Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]¶: Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

class pytext.data.sources.PandasDataSource(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]¶

Bases: pytext.data.sources.data_source.RootDataSource

DataSource which loads data from a pandas DataFrame.

Inputs:

train_df: DataFrame for training

eval_df: DataFrame for evalu

test_df: DataFrame for test

schema: same as base DataSource, define the list of output values with their types

column_mapping: maps the column names in DataFrame to the name defined in schema

classmethod from_config(config: pytext.data.sources.pandas.PandasDataSource.Config, schema: Dict[str, Type[CT_co]])[source]¶

raw_eval_data_generator()[source]¶: Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

static raw_generator(df: Optional[pandas.core.frame.DataFrame])[source]¶

raw_test_data_generator()[source]¶: Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

raw_train_data_generator()[source]¶: Returns a generator that yields the TRAIN data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.

class pytext.data.sources.CoNLLUNERDataSource(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶

Bases: pytext.data.sources.conllu.CoNLLUPOSDataSource

Reads an empty line separated data (word label). This data source supports datasets for NER tasks

class pytext.data.sources.DenseRetrievalDataSource(schema, train_filename=None, test_filename=None, eval_filename=None, num_negative_ctxs=1, use_title=True, use_cache=False)[source]¶

Bases: pytext.data.sources.data_source.DataSource

Data source for DPR (https://github.com/facebookresearch/DPR).

Expects multiline json for lazy loading and improved memory usage. The original DPR files can be converted to multiline json using jq -c .[]

DEFAULT_SCHEMA = {'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>}¶

eval = <pytext.data.sources.data_source.GeneratorIterator object>¶

classmethod from_config(config: pytext.data.sources.dense_retrieval.DenseRetrievalDataSource.Config, schema={'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>})[source]¶

process_file(fname, is_train)[source]¶

read_file(fname)[source]¶

test = <pytext.data.sources.data_source.GeneratorIterator object>¶

train = <pytext.data.sources.data_source.GeneratorIterator object>¶