Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

06/07/2021
by   Melih Barsbey, et al.
0

Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying cause that makes the networks amenable to such simple compression schemes is still missing. In this study, we address this fundamental question and reveal that the dynamics of the training algorithm has a key role in obtaining such compressible networks. Focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently, (ii) for a large step-size/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution. In the case where these two phenomena occur simultaneously, we prove that the networks are guaranteed to be 'ℓ_p-compressible', and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which indeed confirm that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy-tails, which, in combination with overparametrization, result in compressibility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2023

Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD

Neural network compression has been an increasingly important subject, d...
research
06/08/2020

The Heavy-Tail Phenomenon in SGD

In recent years, various notions of capacity and complexity have been pr...
research
05/13/2022

Heavy-Tail Phenomenon in Decentralized SGD

Recent theoretical studies have shown that heavy-tails can emerge in sto...
research
04/18/2018

A Mean Field View of the Landscape of Two-Layers Neural Networks

Multi-layer neural networks are among the most powerful models in machin...
research
06/09/2021

Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms

Understanding generalization in deep learning has been one of the major ...
research
04/27/2023

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

The success of the Adam optimizer on a wide array of architectures has m...
research
02/01/2021

Painless step size adaptation for SGD

Convergence and generalization are two crucial aspects of performance in...

Please sign up or login with your details

Forgot password? Click here to reset