Heavy-Tail Phenomenon in Decentralized SGD

05/13/2022
by   Mert Gurbuzbalaban, et al.
0

Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.

READ FULL TEXT

page 7

page 8

research
06/08/2020

The Heavy-Tail Phenomenon in SGD

In recent years, various notions of capacity and complexity have been pr...
research
11/29/2019

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
research
06/11/2020

Multiplicative noise and heavy tails in stochastic optimization

Although stochastic optimization is central to modern machine learning, ...
research
06/07/2021

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

Neural network compression techniques have become increasingly popular a...
research
02/10/2023

Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD

Cyclic and randomized stepsizes are widely used in the deep learning pra...
research
04/26/2022

An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU Gate

A particular direction of recent advance about stochastic deep-learning ...
research
08/02/2021

Generalization Properties of Stochastic Optimizers via Trajectory Analysis

Despite the ubiquitous use of stochastic optimization algorithms in mach...

Please sign up or login with your details

Forgot password? Click here to reset