On Avoiding Local Minima Using Gradient Descent With Large Learning Rates

by   Amirkeivan Mohtashami, et al.

It has been widely observed in training of neural networks that when applying gradient descent (GD), a large step size is essential for obtaining superior models. However, the effect of large step sizes on the success of GD is not well understood theoretically. We argue that a complete understanding of the mechanics leading to GD's success may indeed require considering effects of using a large step size. To support this claim, we prove on a certain class of functions that GD with large step size follows a different trajectory than GD with a small step size, leading to convergence to the global minimum. We also demonstrate the difference in trajectories for small and large learning rates when GD is applied on a neural network, observing effects of an escape from a local minimum with a large step size, which shows this behavior is indeed relevant in practice. Finally, through a novel set of experiments, we show even though stochastic noise is beneficial, it is not enough to explain success of SGD and a large learning rate is essential for obtaining the best performance even in stochastic settings.


page 1

page 2

page 3

page 4


Speed learning on the fly

The practical performance of online stochastic gradient descent algorith...

Step Size Matters in Deep Learning

Training a neural network with the gradient descent algorithm gives rise...

On Gradient Descent Convergence beyond the Edge of Stability

Gradient Descent (GD) is a powerful workhorse of modern machine learning...

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Adam is shown not being able to converge to the optimal solution in cert...

SGD with large step sizes learns sparse features

We showcase important features of the dynamics of the Stochastic Gradien...

Revisiting the Noise Model of Stochastic Gradient Descent

The stochastic gradient noise (SGN) is a significant factor in the succe...

Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points

This work shows that applying Gradient Descent (GD) with a fixed step si...

Please sign up or login with your details

Forgot password? Click here to reset