Adam Can Converge Without Any Modification on Update Rules

by   Yushun Zhang, et al.

Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., (β_1, β_2); while practical applications often fix the problem first and then tune (β_1, β_2). Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when β_2 is large and β_1 < √(β_2)<1, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. As β_2 increases, our convergence result can cover any β_1 ∈ [0,1) including β_1=0.9, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge under a wide range of hyperparameters without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When β_2 is small, we further point out a large region of (β_1,β_2) where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing β_2. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.


page 1

page 2

page 3

page 4


UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization

Adam-type algorithms have become a preferred choice for optimisation in ...

Momentum Provably Improves Error Feedback!

Due to the high communication overhead when training machine learning mo...

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

Error feedback (EF), also known as error compensation, is an immensely p...

Explore Aggressively, Update Conservatively: Stochastic Extragradient Methods with Variable Stepsize Scaling

Owing to their stability and convergence speed, extragradient methods ha...

On the modes of convergence of Stochastic Optimistic Mirror Descent (OMD) for saddle point problems

In this article, we study the convergence of Mirror Descent (MD) and Opt...

Backtracking Gradient Descent allowing unbounded learning rates

In unconstrained optimisation on an Euclidean space, to prove convergenc...

Locally Adaptive Federated Learning via Stochastic Polyak Stepsizes

State-of-the-art federated learning algorithms such as FedAvg require ca...

Please sign up or login with your details

Forgot password? Click here to reset