On the insufficiency of existing momentum schemes for Stochastic Optimization

03/15/2018
by   Rahul Kidambi, et al.
0

Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for their wide applicability is that when these fast gradient methods are applied in the stochastic case, they partially mimic their exact gradient counterparts, resulting in some practical gain. This work provides a counterpoint to this belief by proving that there exist simple problem instances where these methods cannot outperform SGD despite the best setting of its parameters. These negative problem instances are, in an informal sense, generic; they do not look like carefully constructed pathological instances. These results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of mini-batching. Furthermore, this work provides a viable (and provable) alternative, which, on the same set of problem instances, significantly improves over HB, NAG, and SGD's performance. This algorithm, referred to as Accelerated Stochastic Gradient Descent (ASGD), is a simple to implement stochastic algorithm, based on a relatively less popular variant of Nesterov's Acceleration. Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2018

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Stochastic gradient descent (Sgd) methods are the most powerful optimiza...
research
02/02/2023

Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent

It is well known that the finite step-size (h) in Gradient Descent (GD) ...
research
08/16/2018

Experiential Robot Learning with Accelerated Neuroevolution

Derivative-based optimization techniques such as Stochastic Gradient Des...
research
04/26/2017

Accelerating Stochastic Gradient Descent

There is widespread sentiment that it is not possible to effectively uti...
research
06/20/2019

Data Cleansing for Models Trained with SGD

Data cleansing is a typical approach used to improve the accuracy of mac...
research
06/12/2023

Fast Diffusion Model

Despite their success in real data synthesis, diffusion models (DMs) oft...
research
04/27/2023

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

The success of the Adam optimizer on a wide array of architectures has m...

Please sign up or login with your details

Forgot password? Click here to reset