PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

by   Zhize Li, et al.

In this paper, we propose a novel stochastic gradient estimator—ProbAbilistic Gradient Estimator (PAGE)—for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability p or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability 1-p. We give a simple formula for the optimal choice of p. We prove tight lower bounds for nonconvex problems, which are of independent interest. Moreover, we prove matching upper bounds both in the finite-sum and online regimes, which establish that PAGE is an optimal method. Besides, we show that for nonconvex functions satisfying the Polyak-Łojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch, and the results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating our theoretical results and confirming the practical superiority of PAGE.


page 1

page 2

page 3

page 4


Linear Convergence of Accelerated Stochastic Gradient Descent for Nonconvex Nonsmooth Optimization

In this paper, we study the stochastic gradient descent (SGD) method for...

Online Bootstrap Inference with Nonconvex Stochastic Gradient Descent Estimator

In this paper, we investigate the theoretical properties of stochastic g...

Better Theory for SGD in the Nonconvex World

Large-scale nonconvex optimization problems are ubiquitous in modern mac...

A Short Note of PAGE: Optimal Convergence Rates for Nonconvex Optimization

In this note, we first recall the nonconvex problem setting and introduc...

DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization

We develop and analyze DASHA: a new family of methods for nonconvex dist...

K-SAM: Sharpness-Aware Minimization at the Speed of SGD

Sharpness-Aware Minimization (SAM) has recently emerged as a robust tech...

Please sign up or login with your details

Forgot password? Click here to reset