Robust Training of Neural Networks using Scale Invariant Architectures

by   Zhiyuan Li, et al.

In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by √(2λη), where η is learning rate and λ is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam on downstream tasks.


page 2

page 18


Stable Weight Decay Regularization

Weight decay is a popular regularization technique for training of deep ...

The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

Deep neural networks with batch normalization (BN-DNNs) are invariant to...

Decaying momentum helps neural network training

Momentum is a simple and popular technique in deep learning for gradient...

Adaptive Weight Decay: On The Fly Weight Decay Tuning for Improving Robustness

We introduce adaptive weight decay, which automatically tunes the hyper-...

Strong Lottery Ticket Hypothesis with ε–perturbation

The strong Lottery Ticket Hypothesis (LTH) claims the existence of a sub...

Rotational Optimizers: Simple Robust DNN Training

The training dynamics of modern deep neural networks depend on complex i...

How to Fine-Tune Vision Models with SGD

SGD (with momentum) and AdamW are the two most used optimizers for fine-...

Please sign up or login with your details

Forgot password? Click here to reset