On the Convergence of Adam and Beyond

04/19/2019
by   Sashank J Reddi, et al.
30

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/01/2021

Adam revisited: a weighted past gradients perspective

Adaptive learning rate methods have been successfully applied in many fi...
research
09/29/2020

BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods to Deep Binary Model

Recent methods have significantly reduced the performance degradation of...
research
04/11/2018

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

In several recently proposed stochastic optimization methods (e.g. RMSPr...
research
10/11/2022

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Stochastic optimization algorithms using exponential moving averages of ...
research
01/17/2020

ADAMT: A Stochastic Optimization with Trend Correction Scheme

Adam-type optimizers, as a class of adaptive moment estimation methods w...
research
10/03/2020

Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties

Many popular adaptive gradient methods such as Adam and RMSProp rely on ...
research
07/02/2019

The Role of Memory in Stochastic Optimization

The choice of how to retain information about past gradients dramaticall...

Please sign up or login with your details

Forgot password? Click here to reset