Lookbehind Optimizer: k steps back, 1 step forward

07/31/2023
by   Gonçalo Mordido, et al.
0

The Lookahead optimizer improves the training stability of deep neural networks by having a set of fast weights that "look ahead" to guide the descent direction. Here, we combine this idea with sharpness-aware minimization (SAM) to stabilize its multi-step variant and improve the loss-sharpness trade-off. We propose Lookbehind, which computes k gradient ascent steps ("looking behind") at each iteration and combine the gradients to bias the descent step toward flatter minima. We apply Lookbehind on top of two popular sharpness-aware training methods – SAM and adaptive SAM (ASAM) – and show that our approach leads to a myriad of benefits across a variety of tasks and training regimes. Particularly, we show increased generalization performance, greater robustness against noisy weights, and higher tolerance to catastrophic forgetting in lifelong learning settings.

READ FULL TEXT

page 5

page 16

page 17

research
07/19/2019

Lookahead Optimizer: k steps forward, 1 step back

The vast majority of successful deep neural networks are trained using v...
research
10/15/2020

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as ...
research
03/27/2018

Bayesian Gradient Descent: Online Variational Bayes Learning with Increased Robustness to Catastrophic Forgetting and Weight Pruning

We suggest a novel approach for the estimation of the posterior distribu...
research
04/27/2018

An improvement of the convergence proof of the ADAM-Optimizer

A common way to train neural networks is the Backpropagation. This algor...
research
06/13/2023

Lookaround Optimizer: k steps around, 1 step average

Weight Average (WA) is an active research topic due to its simplicity in...
research
10/06/2022

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Gradient regularization (GR) is a method that penalizes the gradient nor...
research
02/12/2020

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Identifying a divergence problem in Adam, we propose a new optimizer, La...

Please sign up or login with your details

Forgot password? Click here to reset