On the Distributional Properties of Adaptive Gradients

05/15/2021
by   Zhang Zhiyi, et al.
21

Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the update is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm up of the Adam optimizer, contrary to what is believed in the current literature.

READ FULL TEXT

page 6

page 14

page 15

page 17

research
03/03/2022

AdaFamily: A family of Adam-like adaptive gradient methods

We propose AdaFamily, a novel method for training deep neural networks. ...
research
10/27/2020

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

Fully quantized training (FQT), which uses low-bitwidth hardware by quan...
research
07/09/2020

A Study of Gradient Variance in Deep Learning

The impact of gradient noise on training deep models is widely acknowled...
research
02/07/2018

Gradient conjugate priors and deep neural networks

The paper deals with learning the probability distribution of the observ...
research
04/27/2023

Convergence of Adam Under Relaxed Assumptions

In this paper, we provide a rigorous proof of convergence of the Adaptiv...
research
10/11/2022

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Stochastic optimization algorithms using exponential moving averages of ...

Please sign up or login with your details

Forgot password? Click here to reset