On SGD's Failure in Practice: Characterizing and Overcoming Stalling

02/01/2017
by   Vivak Patel, et al.
0

Stochastic Gradient Descent (SGD) is widely used in machine learning problems to efficiently perform empirical risk minimization, yet, in practice, SGD is known to stall before reaching the actual minimizer of the empirical risk. SGD stalling has often been attributed to its sensitivity to the conditioning of the problem; however, as we demonstrate, SGD will stall even when applied to a simple linear regression problem with unity condition number for standard learning rates. Thus, in this work, we numerically demonstrate and mathematically argue that stalling is a crippling and generic limitation of SGD and its variants in practice. Once we have established the problem of stalling, we generalize an existing framework for hedging against its effects, which (1) deters SGD and its variants from stalling, (2) still provides convergence guarantees, and (3) makes SGD and its variants more practical methods for minimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

Stochastic gradient descent (SGD) has been demonstrated to generalize we...
research
11/06/2018

On exponential convergence of SGD in non-convex over-parametrized learning

Large over-parametrized models learned via stochastic gradient descent (...
research
12/03/2015

Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning

Modern proximal and stochastic gradient descent (SGD) methods are believ...
research
10/24/2020

Stochastic Gradient Descent Meets Distribution Regression

Stochastic gradient descent (SGD) provides a simple and efficient way to...
research
01/08/2020

SGD with Hardness Weighted Sampling for Distributionally Robust Deep Learning

Distributionally Robust Optimization (DRO) has been proposed as an alter...
research
02/12/2015

Weighted SGD for ℓ_p Regression with Randomized Preconditioning

In recent years, stochastic gradient descent (SGD) methods and randomize...
research
07/26/2023

Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

Here we develop variants of SGD (stochastic gradient descent) with an ad...

Please sign up or login with your details

Forgot password? Click here to reset