pytext.optimizer package¶

Subpackages¶

pytext.optimizer.sparsifiers package

Submodules¶

pytext.optimizer.activations module¶

pytext.optimizer.activations.get_activation(name, dim=1)[source]¶

pytext.optimizer.adabelief module¶

class pytext.optimizer.adabelief.AdaBelief(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, weight_decouple=False, fixed_decay=False, rectify=False)[source]¶

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients Paper: https://arxiv.org/abs/2010.07468 Implementation has been copied over from the original author (https://github.com/juntang-zhuang/Adabelief-Optimizer)

clip_grad_norm(max_norm, model=None)[source]¶

classmethod from_config(config: pytext.optimizer.adabelief.AdaBelief.Config, model: torch.nn.modules.module.Module)[source]¶

reset()[source]¶

step(closure=None, **kwargs)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

pytext.optimizer.fairseq_fp16_utils module¶

class pytext.optimizer.fairseq_fp16_utils.Fairseq_FP16OptimizerMixin(*args, **kwargs)[source]¶

Bases: object

backward(loss)[source]¶

Computes the sum of gradients of the given tensor w.r.t. graph leaves.

Compared to fairseq.optim.FairseqOptimizer.backward(), this function additionally dynamically scales the loss to avoid gradient underflow.

classmethod build_fp32_params(params)[source]¶

clip_grad_norm(max_norm)[source]¶: Clips gradient norm and updates dynamic loss scaler.

load_state_dict(state_dict, optimizer_overrides=None)[source]¶

Load an optimizer state dict.

In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.

multiply_grads(c)[source]¶: Multiplies grads by a constant c.

state_dict()[source]¶: Return the optimizer’s state dict.

step(closure=None)[source]¶: Performs a single optimization step.

zero_grad()[source]¶: Clears the gradients of all optimized parameters.

class pytext.optimizer.fairseq_fp16_utils.Fairseq_MemoryEfficientFP16OptimizerMixin(*args, **kwargs)[source]¶

Bases: object

backward(loss)[source]¶

Computes the sum of gradients of the given tensor w.r.t. graph leaves.

Compared to fairseq.optim.FairseqOptimizer.backward(), this function additionally dynamically scales the loss to avoid gradient underflow.

clip_grad_norm(max_norm)[source]¶: Clips gradient norm and updates dynamic loss scaler.

load_state_dict(state_dict, optimizer_overrides=None)[source]¶

Load an optimizer state dict.

In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.

multiply_grads(c)[source]¶: Multiplies grads by a constant c.

state_dict()[source]¶: Return the optimizer’s state dict.

step(closure=None)[source]¶: Performs a single optimization step.

zero_grad()[source]¶: Clears the gradients of all optimized parameters.

pytext.optimizer.fp16_optimizer module¶

class pytext.optimizer.fp16_optimizer.DynamicLossScaler(init_scale, scale_factor, scale_window)[source]¶

Bases: object

check_overflow(params)[source]¶

check_overflow_(grad)[source]¶

unscale(grad)[source]¶

unscale_grads(param_groups)[source]¶

update_scale()[source]¶

According to overflow situation, adjust loss scale.

Once overflow happened, we decrease the scale by scale_factor. Setting tolerance is another approach depending on cases.

If we haven’t had overflows for #scale_window times, we should increase the scale by scale_factor.

upscale(loss)[source]¶

class pytext.optimizer.fp16_optimizer.FP16Optimizer(fp32_optimizer)[source]¶

Bases: pytext.optimizer.optimizers.Optimizer

backward(loss)[source]¶

clip_grad_norm(max_norm, model)[source]¶

finalize() → bool[source]¶

load_state_dict(state_dict)[source]¶

param_groups¶

pre_export(model)[source]¶

state_dict()[source]¶

step(closure=None)[source]¶

zero_grad()[source]¶

class pytext.optimizer.fp16_optimizer.FP16OptimizerApex(fp32_optimizer: pytext.optimizer.optimizers.Optimizer, model: torch.nn.modules.module.Module, opt_level: str, init_loss_scale: Optional[int], min_loss_scale: Optional[float])[source]¶

Bases: pytext.optimizer.fp16_optimizer.FP16Optimizer

backward(loss)[source]¶

clip_grad_norm(max_norm, model)[source]¶

classmethod from_config(fp16_config: pytext.optimizer.fp16_optimizer.FP16OptimizerApex.Config, model: torch.nn.modules.module.Module, fp32_config: pytext.optimizer.optimizers.Optimizer.Config, *unused)[source]¶

load_state_dict(state_dict)[source]¶

pre_export(model)[source]¶

state_dict()[source]¶

step(closure=None)[source]¶

zero_grad()[source]¶

class pytext.optimizer.fp16_optimizer.FP16OptimizerDeprecated(init_optimizer, init_scale, scale_factor, scale_window)[source]¶

Bases: object

finalize()[source]¶

load_state_dict(state_dict)[source]¶

scale_loss(loss)[source]¶

state_dict()[source]¶

step()[source]¶

Realize weights update.

Update the grads from model to master. During iteration for parameters, we check overflow after floating grads and copy. Then do unscaling.

If overflow doesn’t happen, call inner optimizer’s step() and copy back the updated weights from inner optimizer to model.

Update loss scale according to overflow checking result.

zero_grad()[source]¶

class pytext.optimizer.fp16_optimizer.FP16OptimizerFairseq(fp16_params, fp32_optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]¶

Bases: fairseq.optim.fp16_optimizer._FP16OptimizerMixin, pytext.optimizer.fp16_optimizer.FP16Optimizer

Wrap an optimizer to support FP16 (mixed precision) training.

clip_grad_norm(max_norm, unused_model)[source]¶: Clips gradient norm and updates dynamic loss scaler.

classmethod from_config(fp16_config: pytext.optimizer.fp16_optimizer.FP16OptimizerFairseq.Config, model: torch.nn.modules.module.Module, fp32_config: pytext.optimizer.optimizers.Optimizer.Config, num_accumulated_batches: int)[source]¶

pre_export(model)[source]¶

class pytext.optimizer.fp16_optimizer.GeneratorFP16Optimizer(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]¶

Bases: pytext.optimizer.fp16_optimizer.PureFP16Optimizer

load_state_dict(state_dict)[source]¶

Load an optimizer state dict.

We prefer the configuration of the existing optimizer instance. After we load state dict to inner_optimizer, we create the copy of references of parameters again as in init().

step()[source]¶

Updates weights.

Effects:

Check overflow, if not, when inner_optimizer supports memory-effcient step, do overall unscale and call memory-efficient step.

If it doesn’t support, modify each parameter list in param_groups of inner_optimizer to a generator of the tensors. Call normal step then, data type changing will be added automatically in that function.

No matter whether it is overflow, we need to update scale at the last step.

class pytext.optimizer.fp16_optimizer.MemoryEfficientFP16OptimizerFairseq(fp16_params, optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]¶

Bases: fairseq.optim.fp16_optimizer._MemoryEfficientFP16OptimizerMixin, pytext.optimizer.fp16_optimizer.FP16Optimizer

Wrap the mem efficient optimizer to support FP16 (mixed precision) training.

clip_grad_norm(max_norm, unused_model)[source]¶: Clips gradient norm and updates dynamic loss scaler.

classmethod from_config(fp16_config: pytext.optimizer.fp16_optimizer.MemoryEfficientFP16OptimizerFairseq.Config, model: torch.nn.modules.module.Module, fp32_config: pytext.optimizer.optimizers.Optimizer.Config, num_accumulated_batches: int)[source]¶

pre_export(model)[source]¶

class pytext.optimizer.fp16_optimizer.PureFP16Optimizer(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]¶

Bases: pytext.optimizer.fp16_optimizer.FP16OptimizerDeprecated

load_state_dict(state_dict)[source]¶

Load an optimizer state dict.

We prefer the configuration of the existing optimizer instance. Realize the same logic as in init() – point the param_groups of outer optimizer to that of the inner_optimizer.

scale_loss(loss)[source]¶

Scale the loss.

Parameters:	loss (pytext.Loss) – loss function object

step()[source]¶

Updates the weights in inner optimizer.

If inner optimizer supports memory efficient, check overflow, unscale and call advanced step.

Otherwise, float weights and grads, check whether grads are overflow during the iteration, if not overflow, unscale grads and call inner optimizer’s step; If overflow happens, do nothing, wait to the end to call half weights and grads (grads will be eliminated in zero_grad)

pytext.optimizer.fp16_optimizer.convert_generator(params, scale)[source]¶

Create the generator for parameter tensors.

For each parameter, we float and unscale it. After the caller calls next(), we realize the half process and start next parameter’s processing.

pytext.optimizer.fp16_optimizer.generate_params(param_groups)[source]¶

pytext.optimizer.fp16_optimizer.initialize(model, optimizer, opt_level, init_scale=65536, scale_factor=2.0, scale_window=2000, memory_efficient=False)[source]¶

pytext.optimizer.fp16_optimizer.master_params(optimizer)[source]¶

pytext.optimizer.fp16_optimizer.scale_loss(loss, optimizer, delay_unscale=False)[source]¶

pytext.optimizer.lamb module¶

class pytext.optimizer.lamb.Lamb(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, min_trust=None)[source]¶

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

Implements Lamb algorithm. THIS WAS DIRECTLY COPIED OVER FROM pytorch/contrib: https://github.com/cybertronai/pytorch-lamb It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962

Has the option for minimum trust LAMB as described in “Single Headed Attention RNN: Stop Thinking With Your Head” section 6.3 https://arxiv.org/abs/1911.11423

classmethod from_config(config: pytext.optimizer.lamb.Lamb.Config, model: torch.nn.modules.module.Module)[source]¶

step(closure=None, **kwargs)[source]¶

Performs a single optimization step.

Parameters:	closure (callable, optional) – A closure that reevaluates the model and returns the loss.

pytext.optimizer.madgrad module¶

class pytext.optimizer.madgrad.MADGRAD(params, lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, k: int = 0)[source]¶

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

MADGRAD Optimizer: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization. Paper: https://arxiv.org/abs/2101.11075

Implementation has been copied over from the original author (https://github.com/facebookresearch/madgrad/blob/master/madgrad/madgrad.py)

add_param_group(param_group)[source]¶

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters:	param_group (dict) – Specifies what Tensors should be optimized along group specific optimization options. (with) –

clip_grad_norm(max_norm, model=None)[source]¶

classmethod from_config(config: pytext.optimizer.madgrad.MADGRAD.Config, model: torch.nn.modules.module.Module)[source]¶

initialize_state()[source]¶

reset_param_groups()[source]¶

step(closure=None, **kwargs) → Optional[float][source]¶

Performs a single optimization step. :param closure: A closure that reevaluates the model

and returns the loss.

supports_flat_params¶

supports_memory_efficient_fp16¶

pytext.optimizer.optimizers module¶

class pytext.optimizer.optimizers.Adagrad(parameters, lr, weight_decay)[source]¶

Bases: torch.optim.adagrad.Adagrad, pytext.optimizer.optimizers.Optimizer

classmethod from_config(config: pytext.optimizer.optimizers.Adagrad.Config, model: torch.nn.modules.module.Module)[source]¶

class pytext.optimizer.optimizers.Adam(parameters, lr, weight_decay, eps)[source]¶

Bases: torch.optim.adam.Adam, pytext.optimizer.optimizers.Optimizer

classmethod from_config(config: pytext.optimizer.optimizers.Adam.Config, model: torch.nn.modules.module.Module)[source]¶

class pytext.optimizer.optimizers.AdamW(parameters, lr, weight_decay, eps)[source]¶

Bases: torch.optim.adamw.AdamW, pytext.optimizer.optimizers.Optimizer

Adds PyText support for Decoupled Weight Decay Regularization for Adam as done in the paper: https://arxiv.org/abs/1711.05101 for more information read the fast.ai blog on this optimization method here: https://www.fast.ai/2018/07/02/adam-weight-decay/

classmethod from_config(config: pytext.optimizer.optimizers.AdamW.Config, model: torch.nn.modules.module.Module)[source]¶

class pytext.optimizer.optimizers.Optimizer(config=None, *args, **kwargs)[source]¶

Bases: pytext.config.component.Component

backward(loss)[source]¶

clip_grad_norm(max_norm, model=None)[source]¶

finalize() → bool[source]¶

multiply_grads(c)[source]¶: Multiplies grads by a constant c.

params¶: Return an iterable of the parameters held by the optimizer.

pre_export(model)[source]¶

reset_param_groups()[source]¶

class pytext.optimizer.optimizers.SGD(parameters, lr, momentum)[source]¶

Bases: torch.optim.sgd.SGD, pytext.optimizer.optimizers.Optimizer

classmethod from_config(config: pytext.optimizer.optimizers.SGD.Config, model: torch.nn.modules.module.Module)[source]¶

pytext.optimizer.optimizers.learning_rates(optimizer)[source]¶

pytext.optimizer.privacy_engine module¶

class pytext.optimizer.privacy_engine.PrivacyEngine(model, optimizer, noise_multiplier, max_grad_norm, batch_size, dataset_size, target_delta, alphas)[source]¶

Bases: pytext.config.component.Component

A wrapper around PrivacyEngine of Opacus

attach(optimizer)[source]¶

detach()[source]¶

classmethod from_config(config: pytext.optimizer.privacy_engine.PrivacyEngine.Config, model, optimizer)[source]¶

get_privacy_spent()[source]¶

pytext.optimizer.radam module¶

class pytext.optimizer.radam.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]¶

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

Implements rectified adam as derived in the following paper: “On the Variance of the Adaptive Learning Rate and Beyond” (https://arxiv.org/abs/1908.03265)

This code is mostly a direct copy-paste of the code provided by the authors here: https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam.py

classmethod from_config(config: pytext.optimizer.radam.RAdam.Config, model: torch.nn.modules.module.Module)[source]¶

step(closure=None, **kwargs)[source]¶

Performs a single optimization step (parameter update).

Parameters:	closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Note

Unless otherwise specified, this function should not modify the .grad field of the parameters.

pytext.optimizer.scheduler module¶

class pytext.optimizer.scheduler.BatchScheduler(config=None, *args, **kwargs)[source]¶

Bases: pytext.optimizer.scheduler.Scheduler

prepare(train_iter, total_epochs)[source]¶

class pytext.optimizer.scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1, verbose=False)[source]¶

Bases: torch.optim.lr_scheduler.CosineAnnealingLR, pytext.optimizer.scheduler.BatchScheduler

Wrapper around torch.optim.lr_scheduler.CosineAnnealingLR See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.CosineAnnealingLR.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]¶

step_batch(metrics=None, epoch=None)[source]¶

class pytext.optimizer.scheduler.CyclicLR(optimizer, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle', cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1, verbose=False)[source]¶

Bases: torch.optim.lr_scheduler.CyclicLR, pytext.optimizer.scheduler.BatchScheduler

Wrapper around torch.optim.lr_scheduler.CyclicLR See the original documentation for more details

classmethod from_config(config: pytext.optimizer.scheduler.CyclicLR.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]¶

step_batch(metrics=None, epoch=None)[source]¶

class pytext.optimizer.scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1, verbose=False)[source]¶

Bases: torch.optim.lr_scheduler.ExponentialLR, pytext.optimizer.scheduler.Scheduler

Wrapper around torch.optim.lr_scheduler.ExponentialLR See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.ExponentialLR.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]¶

step_epoch(metrics=None, epoch=None)[source]¶

class pytext.optimizer.scheduler.LmFineTuning(optimizer, cut_frac=0.1, ratio=32, non_pretrained_param_groups=2, lm_lr_multiplier=1.0, lm_use_per_layer_lr=False, lm_gradual_unfreezing=True, last_epoch=-1)[source]¶

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Fine-tuning methods from the paper “[arXiv:1801.06146]Universal Language Model Fine-tuning for Text Classification”.

Specifically, modifies training schedule using slanted triangular learning rates, discriminative fine-tuning (per-layer learning rates), and gradual unfreezing.

classmethod from_config(config: pytext.optimizer.scheduler.LmFineTuning.Config, optimizer)[source]¶

get_lr()[source]¶

step_batch(metrics=None, epoch=None)[source]¶

class pytext.optimizer.scheduler.PolynomialDecayScheduler(optimizer, warmup_steps, total_steps, end_learning_rate, power)[source]¶

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Applies a polynomial decay with lr warmup to the learning rate.

It is commonly observed that a monotonically decreasing learning rate, whose degree of change is carefully chosen, results in a better performing model.

This scheduler linearly increase learning rate from 0 to final value at the beginning of training, determined by warmup_steps. Then it applies a polynomial decay function to an optimizer step, given a provided base_lrs to reach an end_learning_rate after total_steps.

classmethod from_config(config: pytext.optimizer.scheduler.PolynomialDecayScheduler.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]¶

get_lr()[source]¶

prepare(train_iter, total_epochs)[source]¶

step_batch()[source]¶

class pytext.optimizer.scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08, verbose=False)[source]¶

Bases: torch.optim.lr_scheduler.ReduceLROnPlateau, pytext.optimizer.scheduler.Scheduler

Wrapper around torch.optim.lr_scheduler.ReduceLROnPlateau See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.ReduceLROnPlateau.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]¶

step_epoch(metrics, epoch)[source]¶

class pytext.optimizer.scheduler.Scheduler(config=None, *args, **kwargs)[source]¶

Bases: pytext.config.component.Component

Schedulers help in adjusting the learning rate during training. Scheduler is a wrapper class over schedulers which can be available in torch library or for custom implementations. There are two kinds of lr scheduling that is supported by this class. Per epoch scheduling and per batch scheduling. In per epoch scheduling, the learning rate is adjusted at the end of each epoch and in per batch scheduling the learning rate is adjusted after the forward and backward pass through one batch during the training.

There are two main methods that needs to be implemented by the Scheduler. step_epoch() is called at the end of each epoch and step_batch() is called at the end of each batch in the training data.

prepare() method can be used by BatchSchedulers to initialize any attributes they may need.

prepare(train_iter, total_epochs)[source]¶

step_batch(**kwargs) → None[source]¶

step_epoch(**kwargs) → None[source]¶

class pytext.optimizer.scheduler.SchedulerWithWarmup(optimizer, warmup_scheduler, scheduler, switch_steps)[source]¶

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Wraps another scheduler with a warmup phase. After warmup_steps defined in warmup_scheduler.warmup_steps, the scheduler will switch to use the specified scheduler in scheduler.

warmup_scheduler: is the configuration for the WarmupScheduler, that warms up learning rate over warmup_steps linearly.

scheduler: is the main scheduler that will be applied after the warmup phase (once warmup_steps have passed)

classmethod from_config(config: pytext.optimizer.scheduler.SchedulerWithWarmup.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]¶

get_lr()[source]¶

prepare(train_iter, total_epochs)[source]¶

step_batch()[source]¶

step_epoch(metrics, epoch)[source]¶

class pytext.optimizer.scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False)[source]¶

Bases: torch.optim.lr_scheduler.StepLR, pytext.optimizer.scheduler.Scheduler

Wrapper around torch.optim.lr_scheduler.StepLR See the original documentation for more details.

classmethod from_config(config: pytext.optimizer.scheduler.StepLR.Config, optimizer)[source]¶

step_epoch(metrics=None, epoch=None)[source]¶

class pytext.optimizer.scheduler.WarmupScheduler(optimizer, warmup_steps, inverse_sqrt_decay)[source]¶

Bases: torch.optim.lr_scheduler._LRScheduler, pytext.optimizer.scheduler.BatchScheduler

Scheduler to linearly increase the learning rate from 0 to its final value over a number of steps:

lr = base_lr * current_step / warmup_steps

After the warm-up phase, the scheduler has the option of decaying the learning rate as the inverse square root of the number of training steps taken:

lr = base_lr * sqrt(warmup_steps) / sqrt(current_step)

classmethod from_config(config: pytext.optimizer.scheduler.WarmupScheduler.Config, optimizer: pytext.optimizer.optimizers.Optimizer)[source]¶

get_lr()[source]¶

prepare(train_iter, total_epochs)[source]¶

step_batch()[source]¶

pytext.optimizer.swa module¶

class pytext.optimizer.swa.StochasticWeightAveraging(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]¶

Bases: pytext.optimizer.optimizers.Optimizer, torch.optim.optimizer.Optimizer

add_param_group(param_group)[source]¶

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters:	param_group (dict) – Specifies what Tensors should be optimized along group specific optimization options. (with) –

static bn_update(loader, model, device=None)[source]¶

Updates BatchNorm running_mean, running_var buffers in the model.

It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model.

Parameters:

loader (torch.utils.data.DataLoader) – dataset loader to compute the activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.
model (torch.nn.Module) – model for which we seek to update BatchNorm statistics.
device (torch.device, optional) – If set, data will be trasferred to device before being passed into model.

finalize()[source]¶

Swaps the values of the optimized variables and swa buffers.

It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.

classmethod from_config(config: pytext.optimizer.swa.StochasticWeightAveraging.Config, model: torch.nn.modules.module.Module)[source]¶

load_state_dict(state_dict)[source]¶

Loads the optimizer state.

Parameters:	state_dict (dict) – SWA optimizer state. Should be an object returned from a call to state_dict.

reset_param_groups()[source]¶

state_dict()[source]¶

Returns the state of SWA as a dict.

It contains three entries:

opt_state - a dict holding current optimization state of the base

optimizer. Its content differs between optimizer classes.
swa_state - a dict containing current state of SWA. For each

optimized variable it contains swa_buffer keeping the running average of the variable
param_groups - a dict containing all parameter groups

step(closure=None, **kwargs)[source]¶

Performs a single optimization step.

In automatic mode also updates SWA running averages.

update_swa()[source]¶: Updates the SWA running averages of all optimized parameters.

update_swa_group(group)[source]¶

Updates the SWA running averages for the given parameter group.

Parameters:	param_group (dict) – Specifies for what parameter group SWA running averages should be updated

Examples

>>> # automatic mode
>>> base_opt = torch.optim.SGD([{'params': [x]},
>>>             {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9)
>>> opt = torchcontrib.optim.SWA(base_opt)
>>> for i in range(100):
>>>     opt.zero_grad()
>>>     loss_fn(model(input), target).backward()
>>>     opt.step()
>>>     if i > 10 and i % 5 == 0:
>>>         # Update SWA for the second parameter group
>>>         opt.update_swa_group(opt.param_groups[1])
>>> opt.swap_swa_sgd()

pytext.optimizer package¶

Subpackages¶

Submodules¶

pytext.optimizer.activations module¶

pytext.optimizer.adabelief module¶

pytext.optimizer.fairseq_fp16_utils module¶

pytext.optimizer.fp16_optimizer module¶

pytext.optimizer.lamb module¶

pytext.optimizer.madgrad module¶

pytext.optimizer.optimizers module¶

pytext.optimizer.privacy_engine module¶

pytext.optimizer.radam module¶

pytext.optimizer.scheduler module¶

pytext.optimizer.swa module¶

Module contents¶