On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

11/29/2019
by   Umut Şimşekli, et al.
0

The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT, which suggests that the GN converges to a heavy-tailedα-stable random vector, where tail-indexα determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a Lévy motion. Such SDEs can incur `jumps', which force the SDE and its discretization transition from narrow minima to wider minima, as proven by existing metastability theory and the extensions that we proved recently. In this study, under the α-stable GN assumption, we further establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index α. To validate the α-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2019

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
research
06/21/2019

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Stochastic gradient descent (SGD) has been widely used in machine learni...
research
05/05/2021

Understanding Long Range Memory Effects in Deep Neural Networks

Stochastic gradient descent (SGD) is of fundamental importance in deep l...
research
03/05/2023

Revisiting the Noise Model of Stochastic Gradient Descent

The stochastic gradient noise (SGN) is a significant factor in the succe...
research
05/13/2022

Heavy-Tail Phenomenon in Decentralized SGD

Recent theoretical studies have shown that heavy-tails can emerge in sto...
research
02/08/2021

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

The empirical success of deep learning is often attributed to SGD's myst...
research
11/11/2021

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Stochastic approximation (SA) and stochastic gradient descent (SGD) algo...

Please sign up or login with your details

Forgot password? Click here to reset