Promoting Exploration in Memory-Augmented Adam using Critical Momenta

07/18/2023
by   Pranshu Malviya, et al.
0

Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

READ FULL TEXT

page 5

page 6

page 14

page 15

page 16

page 17

page 20

page 21

research
06/29/2020

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Ra...
research
10/13/2022

GA-SAM: Gradient-Strength based Adaptive Sharpness-Aware Minimization for Improved Generalization

Recently, Sharpness-Aware Minimization (SAM) algorithm has shown state-o...
research
06/22/2021

Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization

Adaptive gradient methods, such as Adam, have achieved tremendous succes...
research
08/12/2021

Logit Attenuating Weight Normalization

Over-parameterized deep networks trained using gradient-based optimizers...
research
12/09/2022

Adversarial Weight Perturbation Improves Generalization in Graph Neural Network

A lot of theoretical and empirical evidence shows that the flatter local...
research
07/06/2018

Memory Augmented Policy Optimization for Program Synthesis with Generalization

This paper presents Memory Augmented Policy Optimization (MAPO): a novel...
research
07/20/2021

Learn2Hop: Learned Optimization on Rough Landscapes

Optimization of non-convex loss surfaces containing many local minima re...

Please sign up or login with your details

Forgot password? Click here to reset