Logarithmic landscape and power-law escape rate of SGD

05/20/2021
by   Takashi Mori, et al.
0

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution P_ss(θ) of the network parameters θ follows a power-law with respect to the loss function L(θ), i.e. P_ss(θ)∝ L(θ)^-ϕ with the exponent ϕ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height Δ L=L(θ^s)-L(θ^*) between a minimum θ^* and a saddle θ^s but by the logarithmized loss barrier height Δlog L=log[L(θ^s)/L(θ^*)]. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2022

When does SGD favor flat minima? A quantitative characterization via linear stability

The observation that stochastic gradient descent (SGD) favors flat minim...
research
06/02/2022

Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions

Generalization is one of the most important problems in deep learning (D...
research
05/27/2021

The Sobolev Regularization Effect of Stochastic Gradient Descent

The multiplicative structure of parameters and input data in the first l...
research
12/07/2020

Stochastic Gradient Descent with Large Learning Rate

As a simple and efficient optimization method in deep learning, stochast...
research
08/21/2021

How Can Increased Randomness in Stochastic Gradient Descent Improve Generalization?

Recent works report that increasing the learning rate or decreasing the ...
research
05/24/2023

Local SGD Accelerates Convergence by Exploiting Second Order Information of the Loss Function

With multiple iterations of updates, local statistical gradient descent ...
research
12/28/2018

A continuous-time analysis of distributed stochastic gradient

Synchronization in distributed networks of nonlinear dynamical systems p...

Please sign up or login with your details

Forgot password? Click here to reset