pytext.data.sources package¶
Submodules¶
pytext.data.sources.conllu module¶
-
class
pytext.data.sources.conllu.
CoNLLUNERDataSource
(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶ Bases:
pytext.data.sources.conllu.CoNLLUPOSDataSource
Reads an empty line separated data (word label). This data source supports datasets for NER tasks
-
class
pytext.data.sources.conllu.
CoNLLUPOSDataSource
(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from CoNLL-U file.
-
classmethod
from_config
(config: pytext.data.sources.conllu.CoNLLUPOSDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
classmethod
pytext.data.sources.data_source module¶
-
class
pytext.data.sources.data_source.
DataSource
(schema: Dict[str, Type[CT_co]])[source]¶ Bases:
pytext.config.component.Component
Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.
Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.
-
class
pytext.data.sources.data_source.
GeneratorIterator
(generator, *args, **kwargs)[source]¶ Bases:
object
Create an object which can be iterated over multiple times from a generator call. Each iteration will call the generator and allow iterating over it. This is unsafe to use on generators which have side effects, such as file readers; it’s up to the callers to safely manage these scenarios.
-
class
pytext.data.sources.data_source.
GeneratorMethodProperty
(generator)[source]¶ Bases:
object
Identify a generator method as a property. This will allow instances to iterate over the property multiple times, and not consume the generator. It accomplishes this by wrapping the generator and creating multiple generator instances if iterated over multiple times.
-
class
pytext.data.sources.data_source.
RawExample
[source]¶ Bases:
dict
A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.
-
class
pytext.data.sources.data_source.
RootDataSource
(schema: Dict[str, Type[CT_co]], column_mapping: Dict[str, str] = ())[source]¶ Bases:
pytext.data.sources.data_source.DataSource
A data source which actually loads data from a location. This data source needs to be responsible for converting types based on a schema, because it should be the only part of the system that actually needs to understand details about the underlying storage system.
RootDataSource presents a simpler abstraction than DataSource where the rows are automatically converted to the right DataTypes.
A RootDataSource should implement raw_train_data_generator, raw_test_data_generator, and raw_eval_data_generator. These functions should yield dictionaries of raw objects which the loading system can convert using the schema loading functions.
-
DATA_SOURCE_TYPES
= {<class 'str'>: <function load_text>, typing.Any: <function load_text>, typing.List[pytext.utils.data.Slot]: <function load_slots>, typing.List[int]: <function load_json>, typing.List[str]: <function load_json>, typing.List[typing.Dict[str, typing.Dict[str, float]]]: <function load_json>, typing.List[float]: <function load_float_list>, ~JSONString: <function load_json_string>, <class 'float'>: <function load_float>, <class 'int'>: <function load_int>}¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
raw_test_data_generator
()[source]¶ Returns a generator that yields the TEST data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
-
class
pytext.data.sources.data_source.
RowShardedDataSource
(data_source: pytext.data.sources.data_source.DataSource, rank=0, world_size=1)[source]¶ Bases:
pytext.data.sources.data_source.ShardedDataSource
Shards a given datasource by row.
-
class
pytext.data.sources.data_source.
SafeFileWrapper
(*args, **kwargs)[source]¶ Bases:
object
A simple wrapper class for files which allows filedescriptors to be managed with normal Python ref counts. Without using this, if you create a file in a from_config you will see a warning along the lines of “ResourceWarning: self._file is acquired but not always released” this is because we’re opening a file not in a context manager (with statement). We want to do it this way because it lets us pass a file object to the DataSource, rather than a filename. This exposes a ton more flexibility and testability, passing filenames is one of the paths towards pain.
However, we don’t have a clear resource management system set up for configuration. from_config functions are the tool that we have to allow objects to specify how they should be created from a configuration, which generally should only happen from the command line, whereas in eg. a notebook you should build the objects with constructors directly. If building from constructors, you can just open a file and pass it, but from_config here needs to create a file object from a configured filename. Python files don’t close automatically, so you also need a system that will close them when the python interpreter shuts down. If you don’t, it will print a resource warning at runtime, as the interpreter manually closes the filehandles (although modern OSs are pretty okay with having open file handles, it’s hard for me to justify exactly why Python is so strict about this; I think one of the main reasons you might actually care is if you have a writeable file handle it might not have flushed properly when the C runtime exits, but Python doesn’t actually distinguish between writeable and non-writeable file handles).
This class is a wrapper that creates a system for (sort-of) safely closing the file handles before the runtime exits. It does this by closing the file when the object’s deleter is called. Although the python standard doesn’t actually make any guarantees about when deleters are called, CPython is reference counted and so as an mplementation detail will call a deleter whenever the last reference to it is removed, which generally will happen to all objects created during program execution as long as there aren’t reference cycles (I don’t actually know off-hand whether the cycle collection is run before shutdown, and anyway the cycles would have to include objects that the runtime itself maintains pointers to, which seems like you’d have to work hard to do and wouldn’t do accidentally). This isn’t true for other python systems like PyPy or Jython which use generational garbage collection and so don’t actually always call destructors before the system shuts down, but again this is only really relevant for mutable files.
An alternative implementation would be to build a resource management system into PyText, something like a function that we use for opening system resources that registers the resources and then we make sure are all closed before system shutdown. That would probably technically be the right solution, but I didn’t really think of that first and also it’s a bit longer to implement.
If you are seeing resource warnings on your system, please file a github issue.
-
class
pytext.data.sources.data_source.
ShardedDataSource
(schema: Dict[str, Type[CT_co]])[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Base class for sharded data sources.
-
pytext.data.sources.data_source.
generator_property
¶ alias of
pytext.data.sources.data_source.GeneratorMethodProperty
pytext.data.sources.dense_retrieval module¶
-
class
pytext.data.sources.dense_retrieval.
DenseRetrievalDataSource
(schema, train_filename=None, test_filename=None, eval_filename=None, num_negative_ctxs=1, use_title=True, use_cache=False)[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Data source for DPR (https://github.com/facebookresearch/DPR).
Expects multiline json for lazy loading and improved memory usage. The original DPR files can be converted to multiline json using jq -c .[]
-
DEFAULT_SCHEMA
= {'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>}¶
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
classmethod
from_config
(config: pytext.data.sources.dense_retrieval.DenseRetrievalDataSource.Config, schema={'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>})[source]¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
pytext.data.sources.pandas module¶
-
class
pytext.data.sources.pandas.
PandasDataSource
(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from a pandas DataFrame.
- Inputs:
train_df: DataFrame for training
eval_df: DataFrame for evalu
test_df: DataFrame for test
schema: same as base DataSource, define the list of output values with their types
column_mapping: maps the column names in DataFrame to the name defined in schema
-
classmethod
from_config
(config: pytext.data.sources.pandas.PandasDataSource.Config, schema: Dict[str, Type[CT_co]])[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
class
pytext.data.sources.pandas.
SessionPandasDataSource
(schema: Dict[str, Type[CT_co]], id_col: str, train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, column_mapping: Dict[str, str] = ())[source]¶ Bases:
pytext.data.sources.pandas.PandasDataSource
,pytext.data.sources.session.SessionDataSource
pytext.data.sources.session module¶
-
class
pytext.data.sources.session.
SessionDataSource
(id_col, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
Data source for session based data, the input data is organized in sessions, each session may have multiple rows. The first column is always the session id. Raw input rows are consolidated by session id and returned as one session per example
pytext.data.sources.squad module¶
-
class
pytext.data.sources.squad.
SquadDataSource
(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)
-
DEFAULT_SCHEMA
= {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}¶
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
classmethod
from_config
(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.squad.
SquadDataSourceForKD
(**kwargs)[source]¶ Bases:
pytext.data.sources.squad.SquadDataSource
Squad-like data along with soft labels (logits). Will return tuples of ( doc, question, answer, answer_start, has_answer, start_logits, end_logits, has_answer_logits, pad_mask, segment_labels )
pytext.data.sources.tsv module¶
-
class
pytext.data.sources.tsv.
BlockShardedTSV
(file, field_names=None, delimiter='t', quoted=False, block_id=0, num_blocks=1, drop_incomplete_rows=False)[source]¶ Bases:
object
Take a TSV file, split into N pieces (by byte location) and return an iterator on one of the pieces. The pieces are equal by byte size, not by number of rows. Thus, care needs to be taken when using this for distributed training, otherwise number of batches for different workers might be different.
-
class
pytext.data.sources.tsv.
BlockShardedTSVDataSource
(rank=0, world_size=1, **kwargs)[source]¶ Bases:
pytext.data.sources.tsv.TSVDataSource
,pytext.data.sources.data_source.ShardedDataSource
-
train_unsharded
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.tsv.
MultilingualTSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', data_source_languages={'eval': ['en'], 'test': ['en'], 'train': ['en']}, language_columns=['language'], **kwargs)[source]¶ Bases:
pytext.data.sources.tsv.TSVDataSource
Data Source for multi-lingual data. The input data can have multiple text fields and each field can either have the same language or different languages. The data_source_languages dict contains the language information for each text field and this should match the number of language identifiers specified in language_columns.
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.tsv.
SessionTSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, **kwargs)[source]¶ Bases:
pytext.data.sources.tsv.TSVDataSource
,pytext.data.sources.session.SessionDataSource
-
class
pytext.data.sources.tsv.
TSV
(file, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False)[source]¶ Bases:
object
-
class
pytext.data.sources.tsv.
TSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from TSV sources. Uses python’s csv library.
-
classmethod
from_config
(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
classmethod
Module contents¶
-
class
pytext.data.sources.
DataSource
(schema: Dict[str, Type[CT_co]])[source]¶ Bases:
pytext.config.component.Component
Data sources are simple components that stream data from somewhere using Python’s iteration interface. It should expose 3 iterators, “train”, “test”, and “eval”. Each of these should be able to be iterated over any number of times, and iterating over it should yield dictionaries whose values are deserialized python types.
Simply, these data sources exist as an interface to read through datasets in a pythonic way, with pythonic types, and abstract away the form that they are stored in.
-
class
pytext.data.sources.
RawExample
[source]¶ Bases:
dict
A wrapper class for a single example row with a dict interface. This is here for any logic we want row objects to have that dicts don’t do.
-
class
pytext.data.sources.
SquadDataSource
(train_filename=None, test_filename=None, eval_filename=None, ignore_impossible=True, max_character_length=1048576, min_overlap=0.1, delimiter='t', quoted=False, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Download data from https://rajpurkar.github.io/SQuAD-explorer/ Will return tuples of (doc, question, answer, answer_start, has_answer)
-
DEFAULT_SCHEMA
= {'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>}¶
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
classmethod
from_config
(config: pytext.data.sources.squad.SquadDataSource.Config, schema={'answer_ends': typing.List[int], 'answer_starts': typing.List[int], 'answers': typing.List[str], 'doc': <class 'str'>, 'has_answer': <class 'str'>, 'id': <class 'int'>, 'question': <class 'str'>})[source]¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
-
class
pytext.data.sources.
TSVDataSource
(train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', quoted=False, drop_incomplete_rows=False, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from TSV sources. Uses python’s csv library.
-
classmethod
from_config
(config: pytext.data.sources.tsv.TSVDataSource.Config, schema: Dict[str, Type[CT_co]], **kwargs)[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
classmethod
-
class
pytext.data.sources.
PandasDataSource
(train_df: Optional[pandas.core.frame.DataFrame] = None, eval_df: Optional[pandas.core.frame.DataFrame] = None, test_df: Optional[pandas.core.frame.DataFrame] = None, **kwargs)[source]¶ Bases:
pytext.data.sources.data_source.RootDataSource
DataSource which loads data from a pandas DataFrame.
- Inputs:
train_df: DataFrame for training
eval_df: DataFrame for evalu
test_df: DataFrame for test
schema: same as base DataSource, define the list of output values with their types
column_mapping: maps the column names in DataFrame to the name defined in schema
-
classmethod
from_config
(config: pytext.data.sources.pandas.PandasDataSource.Config, schema: Dict[str, Type[CT_co]])[source]¶
-
raw_eval_data_generator
()[source]¶ Returns a generator that yields the EVAL data one item at a time in a dictionary where each key is a field and the value is of the raw type from the source. DataSources need to implement this.
-
class
pytext.data.sources.
CoNLLUNERDataSource
(language=None, train_file=None, test_file=None, eval_file=None, field_names=None, delimiter='t', **kwargs)[source]¶ Bases:
pytext.data.sources.conllu.CoNLLUPOSDataSource
Reads an empty line separated data (word label). This data source supports datasets for NER tasks
-
class
pytext.data.sources.
DenseRetrievalDataSource
(schema, train_filename=None, test_filename=None, eval_filename=None, num_negative_ctxs=1, use_title=True, use_cache=False)[source]¶ Bases:
pytext.data.sources.data_source.DataSource
Data source for DPR (https://github.com/facebookresearch/DPR).
Expects multiline json for lazy loading and improved memory usage. The original DPR files can be converted to multiline json using jq -c .[]
-
DEFAULT_SCHEMA
= {'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>}¶
-
eval
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
classmethod
from_config
(config: pytext.data.sources.dense_retrieval.DenseRetrievalDataSource.Config, schema={'negative_ctxs': typing.List[str], 'positive_ctx': <class 'str'>, 'question': <class 'str'>})[source]¶
-
test
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-
train
= <pytext.data.sources.data_source.GeneratorIterator object>¶
-