On the Distributional Properties of Adaptive Gradients
Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims at providing a series of theoretical analyses of its statistical properties justified by experiments. In particular, we show that when the underlying gradient obeys a normal distribution, the variance of the magnitude of the update is an increasing and bounded function of time and does not diverge. This work suggests that the divergence of variance is not the cause of the need for warm up of the Adam optimizer, contrary to what is believed in the current literature.
READ FULL TEXT