Batch Size Matters: A Diffusion Approximation Framework on Nonconvex Stochastic Gradient Descent

05/22/2017
by   Chris Junchi Li, et al.
0

In this paper, we study the stochastic gradient descent method in analyzing nonconvex statistical optimization problems from a diffusion approximation point of view. Using the theory of large deviation of random dynamical system, we prove in the small stepsize regime and the presence of omnidirectional noise the following: starting from a local minimizer (resp. saddle point) the SGD iteration escapes in a number of iteration that is exponentially (resp. linearly) dependent on the inverse stepsize. We take the deep neural network as an example to study this phenomenon. Based on a new analysis of the mixing rate of multidimensional Ornstein-Uhlenbeck processes, our theory substantiate a very recent empirical results by keskar2016large, suggesting that large batch sizes in training deep learning for synchronous optimization leads to poor generalization error.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2017

Linear Convergence of Accelerated Stochastic Gradient Descent for Nonconvex Nonsmooth Optimization

In this paper, we study the stochastic gradient descent (SGD) method for...
research
08/29/2018

Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes

Solving statistical learning problems often involves nonconvex optimizat...
research
02/09/2016

Poor starting points in machine learning

Poor (even random) starting points for learning/training/optimization ar...
research
01/16/2023

Stability Analysis of Sharpness-Aware Minimization

Sharpness-aware minimization (SAM) is a recently proposed training metho...
research
02/14/2018

Toward Deeper Understanding of Nonconvex Stochastic Optimization with Momentum using Diffusion Approximations

Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely ap...
research
02/06/2023

Stochastic Gradient Descent-induced drift of representation in a two-layer neural network

Representational drift refers to over-time changes in neural activation ...
research
05/17/2018

Interpolatron: Interpolation or Extrapolation Schemes to Accelerate Optimization for Deep Neural Networks

In this paper we explore acceleration techniques for large scale nonconv...

Please sign up or login with your details

Forgot password? Click here to reset