Config Files Explained

PyText Models and training Tasks contain many components, and each components expects many parameters to define their behavior. PyText uses a config to specify those parameters. The config is can be loaded from a JSON file, which is what we describe here.

Structure of a Config File

A typical config file only contains the parameters specific to your project. Here’s a fully working JSON file, and it does not need to be more complicated than this:

{
    "task": {
        "DocumentClassificationTask": {
            "data": {
              "source": {
                "TSVDataSource": {
                  "field_names": ["label", "text"],
                  "train_filename": "my/data/train.tsv",
                  "eval_filename": "my/data/eval.tsv",
                  "test_filename": "my/data/test.tsv"
                }
            },
            "model": {
                "embedding": {
                    "embed_dim": 200
                }
            }
        }
    },
    "version": 15
}

At the top level, the most important settings are the “task” and the “version”. “task” defines the Task component to be used, which specifies where to get the “data”, which “model” to train, which “trainer” to use, and which “metric_reporter” will present the results.

Each of those parameters can be a Component that is specified by its class name, or omitted to use the default class with its default parameters. In the example above, we specify TSVDataSource to use this class, but we skip the model class name because we want to use the default DocModel.

The “version” number helps PyText maintain backwards compatibility. PyText will use config adapters to internally try and update the configs to match the latest component parameters so you don’t have to keep changing your configs at each PyText update. To manually update your config to the latest version, you can use the update-config command.

Parameters in Config File

Parameters are either a component or a value. In the config above, we see that “field_names” expects a list of strings, “train_filename” expects a string, and “embed_dim” expects an integer.

“source” and “model” however expect a component, and as we’ve seen in the previous section, we can optionally specify the class name of a component if we decide to use a component that is not the default. We can tell whether it’s a class name or a parameter name by looking at the first letter: class names start with an upper case letter. For “source” we decided to specify TSVDataSource, but for “model” we did not and decided to let DocumentClassificationTask use its default DocModel. We could have specified the class name like this, and that would be equivalent:

"model": {
    "DocModel": {
        "embedding": {
            "embed_dim": 100
        }
    }
}

In the next example, the default representation for DocModel is BiLSTMDocAttention. We did not specify “representation” before because we were happy with this default. But if we decide to use DocNNRepresentation instead, we would modify the config like this:

"model": {
    "embedding": {
        "embed_dim": 100
    },
    "representation": {
        "DocNNRepresentation": {
        }
    }
}

In this example we just want to change the class of “representation” and use its default parameters, so we don’t need to specify any of them and we can leave its parameters set empty {}.

To explore more components parameters and their possible values, you can use the help-config command or browse the class documentation.

Changing a Config File

Users typically start with an existing config file, or create one using the gen-default-config command, and then edit it to tweak the parameters.

The file generated by gen-default-config is very large, because it contains the default value of every parameter for every component. Any of those parameters can be omitted from the config file, because PyText can recover their default values.

In general, you should remove from your config file all the parameters you don’t want to override and keep those you do want to override now, or you might want to tweak later.

For example, TSVDataSource can use a different “delimiter”, but in most cases we want to use the default “\t” for tab-separated-values files (TSV), so the config above does not specify “delimiter”: “\t”. If we wanted to load a CVS file, we could override this default by adding our own “delimiter” to our config (and since CVS fields can be “quoted”, unlike TSV where this option’s default is false, we’d also override it with true.)

"TSVDataSource": {
    "delimiter": ",",
    "quoted": true,
    "field_names": ["label", "text"],
    "train_filename": "my/data/train.csv",
    "eval_filename": "my/data/eval.csv",
    "test_filename": "my/data/test.csv"
}

The config at the top of this page is a fully working example. It could be simplified even further by removing the “model” section if you don’t want to change any of the model parameters, but in this case I guess the author decided to tweak “embed_dim”.

JSON Format Primer

A few notes about the JSON syntax and the differences with python:

  • field names and string values should all be quoted with “double-quotes”
  • booleans are lower case: true, false
  • no trailing comma (after the last value of a block)
  • empty value is: null
  • indentation is optional but recommended for readability
  • the first character must be { and the last one must be }
  • obviously all brackets must be balanced: {}, []