A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta

06/22/2022
by   Maksim Velikanov, et al.
0

Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models. In this paper we develop a new analytic framework to analyze mini-batch SGD for linear models at different momenta and sizes of batches. Our key idea is to describe the loss value sequence in terms of its generating function, which can be written in a compact form assuming a diagonal approximation for the second moments of model weights. By analyzing this generating function, we deduce various conclusions on the convergence conditions, phase structure of the model, and optimal learning settings. As a few examples, we show that 1) the optimization trajectory can generally switch from the "signal-dominated" to the "noise-dominated" phase, at a time scale that can be predicted analytically; 2) in the "signal-dominated" (but not the "noise-dominated") phase it is favorable to choose a large effective learning rate, however its value must be limited for any finite batch size to avoid divergence; 3) optimal convergence rate can be achieved at a negative momentum. We verify our theoretical predictions by extensive experiments with MNIST and synthetic problems, and find a good quantitative agreement.

READ FULL TEXT

page 9

page 35

research
06/02/2022

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

We analyze the dynamics of large batch stochastic gradient descent with ...
research
12/02/2021

On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective

We study the statistical properties of the dynamic trajectory of stochas...
research
05/03/2020

Adaptive Learning of the Optimal Mini-Batch Size of SGD

Recent advances in the theoretical understandingof SGD (Qian et al., 201...
research
04/26/2019

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

With an increasing demand for training powers for deep learning algorith...
research
01/27/2019

SGD: General Analysis and Improved Rates

We propose a general yet simple theorem describing the convergence of SG...
research
06/02/2017

Streaming Bayesian inference: theoretical limits and mini-batch approximate message-passing

In statistical learning for real-world large-scale data problems, one mu...

Please sign up or login with your details

Forgot password? Click here to reset