Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

02/20/2021
by   Hongjian Wang, et al.
0

Recent studies have provided both empirical and theoretical evidence illustrating that heavy tails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tails potentially result in iterates with diverging variance, which hinders the use of conventional convergence analysis techniques that rely on the existence of the second-order moments. In this paper, we provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance, for a class of strongly convex objectives. In the case where the p-th moment of the noise exists for some p∈ [1,2), we first identify a condition on the Hessian, coined 'p-positive (semi-)definiteness', that leads to an interesting interpolation between positive semi-definite matrices (p=2) and diagonally dominant matrices with non-negative diagonal entries (p=1). Under this condition, we then provide a convergence rate for the distance to the global optimum in L^p. Furthermore, we provide a generalized central limit theorem, which shows that the properly scaled Polyak-Ruppert averaging converges weakly to a multivariate α-stable random vector. Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function or to the algorithm itself, as typically required in robust statistics. We demonstrate the implications of our results to applications such as linear regression and generalized linear models subject to heavy-tailed data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2020

The Heavy-Tail Phenomenon in SGD

In recent years, various notions of capacity and complexity have been pr...
research
12/06/2019

Why ADAM Beats SGD for Attention Models

While stochastic gradient descent (SGD) is still the de facto algorithm ...
research
06/13/2023

Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD

Neural network compression has been an increasingly important subject, d...
research
04/03/2019

Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT

We provide non-asymptotic convergence rates of the Polyak-Ruppert averag...
research
08/15/2022

Convergence Rates for Stochastic Approximation on a Boundary

We analyze the behavior of projected stochastic gradient descent focusin...
research
07/25/2021

Revisiting Analog Over-the-Air Machine Learning: The Blessing and Curse of Interference

We study a distributed machine learning problem carried out by an edge s...
research
06/11/2020

Multiplicative noise and heavy tails in stochastic optimization

Although stochastic optimization is central to modern machine learning, ...

Please sign up or login with your details

Forgot password? Click here to reset