From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent

by   Satyen Kale, et al.

Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is providing general conditions under which SGD converges, assuming that GF on the population loss converges. Our main tool to establish this connection is a general converse Lyapunov like theorem, which implies the existence of a Lyapunov potential under mild assumptions on the rates of convergence of GF. In fact, using these potentials, we show a one-to-one correspondence between rates of convergence of GF and geometrical properties of the underlying objective. When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the paths of GF and GD/SGD are quite far apart). It turns out that these self-bounding assumptions are in a sense also necessary for GD/SGD to work. Using our framework, we provide a unified analysis for GD/SGD not only for classical settings like convex losses, or objectives that satisfy PL / KL properties, but also for more complex problems including Phase Retrieval and Matrix sq-root, and extending the results in the recent work of Chatterjee 2022.


page 1

page 2

page 3

page 4


On the Convergence to a Global Solution of Shuffling-Type Gradient Algorithms

Stochastic gradient descent (SGD) algorithm is the method of choice in m...

Bayesian interpretation of SGD as Ito process

The current interpretation of stochastic gradient descent (SGD) as a sto...

On uniform-in-time diffusion approximation for stochastic gradient descent

The diffusion approximation of stochastic gradient descent (SGD) in curr...

Choosing the Sample with Lowest Loss makes SGD Robust

The presence of outliers can potentially significantly skew the paramete...

Global Convergence and Stability of Stochastic Gradient Descent

In machine learning, stochastic gradient descent (SGD) is widely deploye...

SGD Through the Lens of Kolmogorov Complexity

We prove that stochastic gradient descent (SGD) finds a solution that ac...

Stochastic Mirror Descent in Average Ensemble Models

The stochastic mirror descent (SMD) algorithm is a general class of trai...

Please sign up or login with your details

Forgot password? Click here to reset