Dual Averaging is Surprisingly Effective for Deep Learning Optimization

10/20/2020
by   Samy Jelassi, et al.
1

First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks. However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance. For instance, SGD with momentum (SGD+M) is typically used in computer vision (CV) and Adam is used for training transformer models for Natural Language Processing (NLP). Using the wrong method can lead to significant performance degradation. Inspired by the dual averaging algorithm, we propose Modernized Dual Averaging (MDA), an optimizer that is able to perform as well as SGD+M in CV and as Adam in NLP. Our method is not adaptive and is significantly simpler than Adam. We show that MDA induces a decaying uncentered L_2-regularization compared to vanilla SGD+M and hypothesize that this may explain why it works on NLP problems where SGD+M fails.

READ FULL TEXT
research
07/11/2018

Modified Regularized Dual Averaging Method for Training Sparse Convolutional Neural Networks

We proposed a modified regularized dual averaging method for training sp...
research
01/26/2021

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

We introduce MADGRAD, a novel optimization method in the family of AdaGr...
research
04/30/2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Deep learning at scale is dominated by communication time. Distributing ...
research
11/16/2020

Mixing ADAM and SGD: a Combined Optimization Method

Optimization methods (optimizers) get special attention for the efficien...
research
03/02/2020

Iterate Averaging Helps: An Alternative Perspective in Deep Learning

Iterate averaging has a rich history in optimisation, but has only very ...
research
08/15/2020

Obtaining Adjustable Regularization for Free via Iterate Averaging

Regularization for optimization is a crucial technique to avoid overfitt...
research
05/14/2019

Robust Neural Network Training using Periodic Sampling over Model Weights

Deep neural networks provide best-in-class performance for a number of c...

Please sign up or login with your details

Forgot password? Click here to reset