pytext.optimizer package¶
Subpackages¶
Submodules¶
pytext.optimizer.activations module¶
pytext.optimizer.adabelief module¶
-
class
pytext.optimizer.adabelief.
AdaBelief
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, weight_decouple=False, fixed_decay=False, rectify=False)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
AdaBelief Optimizer, adapting stepsizes by the belief in observed gradients Paper: https://arxiv.org/abs/2010.07468 Implementation has been copied over from the original author (https://github.com/juntang-zhuang/Adabelief-Optimizer)
pytext.optimizer.fairseq_fp16_utils module¶
-
class
pytext.optimizer.fairseq_fp16_utils.
Fairseq_FP16OptimizerMixin
(*args, **kwargs)[source]¶ Bases:
object
-
backward
(loss)[source]¶ Computes the sum of gradients of the given tensor w.r.t. graph leaves.
Compared to
fairseq.optim.FairseqOptimizer.backward()
, this function additionally dynamically scales the loss to avoid gradient underflow.
-
load_state_dict
(state_dict, optimizer_overrides=None)[source]¶ Load an optimizer state dict.
In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.
-
-
class
pytext.optimizer.fairseq_fp16_utils.
Fairseq_MemoryEfficientFP16OptimizerMixin
(*args, **kwargs)[source]¶ Bases:
object
-
backward
(loss)[source]¶ Computes the sum of gradients of the given tensor w.r.t. graph leaves.
Compared to
fairseq.optim.FairseqOptimizer.backward()
, this function additionally dynamically scales the loss to avoid gradient underflow.
-
load_state_dict
(state_dict, optimizer_overrides=None)[source]¶ Load an optimizer state dict.
In general we should prefer the configuration of the existing optimizer instance (e.g., learning rate) over that found in the state_dict. This allows us to resume training from a checkpoint using a new set of optimizer args.
-
pytext.optimizer.fp16_optimizer module¶
-
class
pytext.optimizer.fp16_optimizer.
DynamicLossScaler
(init_scale, scale_factor, scale_window)[source]¶ Bases:
object
-
class
pytext.optimizer.fp16_optimizer.
FP16Optimizer
(fp32_optimizer)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
-
param_groups
¶
-
-
class
pytext.optimizer.fp16_optimizer.
FP16OptimizerApex
(fp32_optimizer: pytext.optimizer.optimizers.Optimizer, model: torch.nn.modules.module.Module, opt_level: str, init_loss_scale: Optional[int], min_loss_scale: Optional[float])[source]¶
-
class
pytext.optimizer.fp16_optimizer.
FP16OptimizerDeprecated
(init_optimizer, init_scale, scale_factor, scale_window)[source]¶ Bases:
object
-
step
()[source]¶ Realize weights update.
Update the grads from model to master. During iteration for parameters, we check overflow after floating grads and copy. Then do unscaling.
If overflow doesn’t happen, call inner optimizer’s step() and copy back the updated weights from inner optimizer to model.
Update loss scale according to overflow checking result.
-
-
class
pytext.optimizer.fp16_optimizer.
FP16OptimizerFairseq
(fp16_params, fp32_optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]¶ Bases:
fairseq.optim.fp16_optimizer._FP16OptimizerMixin
,pytext.optimizer.fp16_optimizer.FP16Optimizer
Wrap an optimizer to support FP16 (mixed precision) training.
-
clip_grad_norm
(max_norm, unused_model)[source]¶ Clips gradient norm and updates dynamic loss scaler.
-
-
class
pytext.optimizer.fp16_optimizer.
GeneratorFP16Optimizer
(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]¶ Bases:
pytext.optimizer.fp16_optimizer.PureFP16Optimizer
-
load_state_dict
(state_dict)[source]¶ Load an optimizer state dict.
We prefer the configuration of the existing optimizer instance. After we load state dict to inner_optimizer, we create the copy of references of parameters again as in init().
-
step
()[source]¶ Updates weights.
- Effects:
Check overflow, if not, when inner_optimizer supports memory-effcient step, do overall unscale and call memory-efficient step.
If it doesn’t support, modify each parameter list in param_groups of inner_optimizer to a generator of the tensors. Call normal step then, data type changing will be added automatically in that function.
No matter whether it is overflow, we need to update scale at the last step.
-
-
class
pytext.optimizer.fp16_optimizer.
MemoryEfficientFP16OptimizerFairseq
(fp16_params, optimizer, init_loss_scale, scale_window, scale_tolerance, threshold_loss_scale, min_loss_scale, num_accumulated_batches)[source]¶ Bases:
fairseq.optim.fp16_optimizer._MemoryEfficientFP16OptimizerMixin
,pytext.optimizer.fp16_optimizer.FP16Optimizer
Wrap the mem efficient optimizer to support FP16 (mixed precision) training.
-
clip_grad_norm
(max_norm, unused_model)[source]¶ Clips gradient norm and updates dynamic loss scaler.
-
-
class
pytext.optimizer.fp16_optimizer.
PureFP16Optimizer
(init_optimizer, init_scale=65536.0, scale_factor=2, scale_window=2000)[source]¶ Bases:
pytext.optimizer.fp16_optimizer.FP16OptimizerDeprecated
-
load_state_dict
(state_dict)[source]¶ Load an optimizer state dict.
We prefer the configuration of the existing optimizer instance. Realize the same logic as in init() – point the param_groups of outer optimizer to that of the inner_optimizer.
-
step
()[source]¶ Updates the weights in inner optimizer.
If inner optimizer supports memory efficient, check overflow, unscale and call advanced step.
Otherwise, float weights and grads, check whether grads are overflow during the iteration, if not overflow, unscale grads and call inner optimizer’s step; If overflow happens, do nothing, wait to the end to call half weights and grads (grads will be eliminated in zero_grad)
-
-
pytext.optimizer.fp16_optimizer.
convert_generator
(params, scale)[source]¶ Create the generator for parameter tensors.
For each parameter, we float and unscale it. After the caller calls next(), we realize the half process and start next parameter’s processing.
pytext.optimizer.lamb module¶
-
class
pytext.optimizer.lamb.
Lamb
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, min_trust=None)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
Implements Lamb algorithm. THIS WAS DIRECTLY COPIED OVER FROM pytorch/contrib: https://github.com/cybertronai/pytorch-lamb It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. https://arxiv.org/abs/1904.00962
Has the option for minimum trust LAMB as described in “Single Headed Attention RNN: Stop Thinking With Your Head” section 6.3 https://arxiv.org/abs/1911.11423
pytext.optimizer.madgrad module¶
-
class
pytext.optimizer.madgrad.
MADGRAD
(params, lr: float = 0.01, momentum: float = 0.9, weight_decay: float = 0, eps: float = 1e-06, k: int = 0)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
MADGRAD Optimizer: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization. Paper: https://arxiv.org/abs/2101.11075
Implementation has been copied over from the original author (https://github.com/facebookresearch/madgrad/blob/master/madgrad/madgrad.py)
-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.Parameters: - param_group (dict) – Specifies what Tensors should be optimized along
- group specific optimization options. (with) –
-
classmethod
from_config
(config: pytext.optimizer.madgrad.MADGRAD.Config, model: torch.nn.modules.module.Module)[source]¶
-
step
(closure=None, **kwargs) → Optional[float][source]¶ Performs a single optimization step. :param closure: A closure that reevaluates the model
and returns the loss.
-
supports_flat_params
¶
-
supports_memory_efficient_fp16
¶
-
pytext.optimizer.optimizers module¶
-
class
pytext.optimizer.optimizers.
Adagrad
(parameters, lr, weight_decay)[source]¶ Bases:
torch.optim.adagrad.Adagrad
,pytext.optimizer.optimizers.Optimizer
-
class
pytext.optimizer.optimizers.
Adam
(parameters, lr, weight_decay, eps)[source]¶ Bases:
torch.optim.adam.Adam
,pytext.optimizer.optimizers.Optimizer
-
class
pytext.optimizer.optimizers.
AdamW
(parameters, lr, weight_decay, eps)[source]¶ Bases:
torch.optim.adamw.AdamW
,pytext.optimizer.optimizers.Optimizer
Adds PyText support for Decoupled Weight Decay Regularization for Adam as done in the paper: https://arxiv.org/abs/1711.05101 for more information read the fast.ai blog on this optimization method here: https://www.fast.ai/2018/07/02/adam-weight-decay/
-
class
pytext.optimizer.optimizers.
Optimizer
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.config.component.Component
-
params
¶ Return an iterable of the parameters held by the optimizer.
-
-
class
pytext.optimizer.optimizers.
SGD
(parameters, lr, momentum)[source]¶ Bases:
torch.optim.sgd.SGD
,pytext.optimizer.optimizers.Optimizer
pytext.optimizer.privacy_engine module¶
-
class
pytext.optimizer.privacy_engine.
PrivacyEngine
(model, optimizer, noise_multiplier, max_grad_norm, batch_size, dataset_size, target_delta, alphas)[source]¶ Bases:
pytext.config.component.Component
A wrapper around PrivacyEngine of Opacus
pytext.optimizer.radam module¶
-
class
pytext.optimizer.radam.
RAdam
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
Implements rectified adam as derived in the following paper: “On the Variance of the Adaptive Learning Rate and Beyond” (https://arxiv.org/abs/1908.03265)
This code is mostly a direct copy-paste of the code provided by the authors here: https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam.py
-
classmethod
from_config
(config: pytext.optimizer.radam.RAdam.Config, model: torch.nn.modules.module.Module)[source]¶
-
step
(closure=None, **kwargs)[source]¶ Performs a single optimization step (parameter update).
Parameters: closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers. Note
Unless otherwise specified, this function should not modify the
.grad
field of the parameters.
-
classmethod
pytext.optimizer.scheduler module¶
-
class
pytext.optimizer.scheduler.
CosineAnnealingLR
(optimizer, T_max, eta_min=0, last_epoch=-1, verbose=False)[source]¶ Bases:
torch.optim.lr_scheduler.CosineAnnealingLR
,pytext.optimizer.scheduler.BatchScheduler
Wrapper around torch.optim.lr_scheduler.CosineAnnealingLR See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
CyclicLR
(optimizer, base_lr, max_lr, step_size_up=2000, step_size_down=None, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle', cycle_momentum=True, base_momentum=0.8, max_momentum=0.9, last_epoch=-1, verbose=False)[source]¶ Bases:
torch.optim.lr_scheduler.CyclicLR
,pytext.optimizer.scheduler.BatchScheduler
Wrapper around torch.optim.lr_scheduler.CyclicLR See the original documentation for more details
-
class
pytext.optimizer.scheduler.
ExponentialLR
(optimizer, gamma, last_epoch=-1, verbose=False)[source]¶ Bases:
torch.optim.lr_scheduler.ExponentialLR
,pytext.optimizer.scheduler.Scheduler
Wrapper around torch.optim.lr_scheduler.ExponentialLR See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
LmFineTuning
(optimizer, cut_frac=0.1, ratio=32, non_pretrained_param_groups=2, lm_lr_multiplier=1.0, lm_use_per_layer_lr=False, lm_gradual_unfreezing=True, last_epoch=-1)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Fine-tuning methods from the paper “[arXiv:1801.06146]Universal Language Model Fine-tuning for Text Classification”.
Specifically, modifies training schedule using slanted triangular learning rates, discriminative fine-tuning (per-layer learning rates), and gradual unfreezing.
-
class
pytext.optimizer.scheduler.
PolynomialDecayScheduler
(optimizer, warmup_steps, total_steps, end_learning_rate, power)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Applies a polynomial decay with lr warmup to the learning rate.
It is commonly observed that a monotonically decreasing learning rate, whose degree of change is carefully chosen, results in a better performing model.
This scheduler linearly increase learning rate from 0 to final value at the beginning of training, determined by warmup_steps. Then it applies a polynomial decay function to an optimizer step, given a provided base_lrs to reach an end_learning_rate after total_steps.
-
class
pytext.optimizer.scheduler.
ReduceLROnPlateau
(optimizer, mode='min', factor=0.1, patience=10, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08, verbose=False)[source]¶ Bases:
torch.optim.lr_scheduler.ReduceLROnPlateau
,pytext.optimizer.scheduler.Scheduler
Wrapper around torch.optim.lr_scheduler.ReduceLROnPlateau See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
Scheduler
(config=None, *args, **kwargs)[source]¶ Bases:
pytext.config.component.Component
Schedulers help in adjusting the learning rate during training. Scheduler is a wrapper class over schedulers which can be available in torch library or for custom implementations. There are two kinds of lr scheduling that is supported by this class. Per epoch scheduling and per batch scheduling. In per epoch scheduling, the learning rate is adjusted at the end of each epoch and in per batch scheduling the learning rate is adjusted after the forward and backward pass through one batch during the training.
There are two main methods that needs to be implemented by the Scheduler. step_epoch() is called at the end of each epoch and step_batch() is called at the end of each batch in the training data.
prepare() method can be used by BatchSchedulers to initialize any attributes they may need.
-
class
pytext.optimizer.scheduler.
SchedulerWithWarmup
(optimizer, warmup_scheduler, scheduler, switch_steps)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Wraps another scheduler with a warmup phase. After warmup_steps defined in warmup_scheduler.warmup_steps, the scheduler will switch to use the specified scheduler in scheduler.
warmup_scheduler: is the configuration for the WarmupScheduler, that warms up learning rate over warmup_steps linearly.
scheduler: is the main scheduler that will be applied after the warmup phase (once warmup_steps have passed)
-
class
pytext.optimizer.scheduler.
StepLR
(optimizer, step_size, gamma=0.1, last_epoch=-1, verbose=False)[source]¶ Bases:
torch.optim.lr_scheduler.StepLR
,pytext.optimizer.scheduler.Scheduler
Wrapper around torch.optim.lr_scheduler.StepLR See the original documentation for more details.
-
class
pytext.optimizer.scheduler.
WarmupScheduler
(optimizer, warmup_steps, inverse_sqrt_decay)[source]¶ Bases:
torch.optim.lr_scheduler._LRScheduler
,pytext.optimizer.scheduler.BatchScheduler
Scheduler to linearly increase the learning rate from 0 to its final value over a number of steps:
lr = base_lr * current_step / warmup_stepsAfter the warm-up phase, the scheduler has the option of decaying the learning rate as the inverse square root of the number of training steps taken:
lr = base_lr * sqrt(warmup_steps) / sqrt(current_step)
pytext.optimizer.swa module¶
-
class
pytext.optimizer.swa.
StochasticWeightAveraging
(optimizer, swa_start=None, swa_freq=None, swa_lr=None)[source]¶ Bases:
pytext.optimizer.optimizers.Optimizer
,torch.optim.optimizer.Optimizer
-
add_param_group
(param_group)[source]¶ Add a param group to the
Optimizer
s param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizer
as training progresses.Parameters: - param_group (dict) – Specifies what Tensors should be optimized along
- group specific optimization options. (with) –
-
static
bn_update
(loader, model, device=None)[source]¶ Updates BatchNorm running_mean, running_var buffers in the model.
It performs one pass over data in loader to estimate the activation statistics for BatchNorm layers in the model.
Parameters: - loader (torch.utils.data.DataLoader) – dataset loader to compute the activation statistics on. Each data batch should be either a tensor, or a list/tuple whose first element is a tensor containing data.
- model (torch.nn.Module) – model for which we seek to update BatchNorm statistics.
- device (torch.device, optional) – If set, data will be trasferred to
device
before being passed intomodel
.
-
finalize
()[source]¶ Swaps the values of the optimized variables and swa buffers.
It’s meant to be called in the end of training to use the collected swa running averages. It can also be used to evaluate the running averages during training; to continue training swap_swa_sgd should be called again.
-
classmethod
from_config
(config: pytext.optimizer.swa.StochasticWeightAveraging.Config, model: torch.nn.modules.module.Module)[source]¶
-
load_state_dict
(state_dict)[source]¶ Loads the optimizer state.
Parameters: state_dict (dict) – SWA optimizer state. Should be an object returned from a call to state_dict.
-
state_dict
()[source]¶ Returns the state of SWA as a
dict
.- It contains three entries:
- opt_state - a dict holding current optimization state of the base
- optimizer. Its content differs between optimizer classes.
- swa_state - a dict containing current state of SWA. For each
- optimized variable it contains swa_buffer keeping the running average of the variable
- param_groups - a dict containing all parameter groups
-
step
(closure=None, **kwargs)[source]¶ Performs a single optimization step.
In automatic mode also updates SWA running averages.
-
update_swa_group
(group)[source]¶ Updates the SWA running averages for the given parameter group.
Parameters: param_group (dict) – Specifies for what parameter group SWA running averages should be updated Examples
>>> # automatic mode >>> base_opt = torch.optim.SGD([{'params': [x]}, >>> {'params': [y], 'lr': 1e-3}], lr=1e-2, momentum=0.9) >>> opt = torchcontrib.optim.SWA(base_opt) >>> for i in range(100): >>> opt.zero_grad() >>> loss_fn(model(input), target).backward() >>> opt.step() >>> if i > 10 and i % 5 == 0: >>> # Update SWA for the second parameter group >>> opt.update_swa_group(opt.param_groups[1]) >>> opt.swap_swa_sgd()
-