Making the Last Iterate of SGD Information Theoretically Optimal

04/29/2019
by   Prateek Jain, et al.
0

Stochastic gradient descent (SGD) is one of the most widely used algorithms for large scale optimization problems. While classical theoretical analysis of SGD for convex problems studies (suffix) averages of iterates and obtains information theoretically optimal bounds on suboptimality, the last point of SGD is, by far, the most preferred choice in practice. The best known results for last point of SGD shamir2013stochastic however, are suboptimal compared to information theoretic lower bounds by a T factor, where T is the number of iterations. harvey2018tight shows that in fact, this additional T factor is tight for standard step size sequences of 1/√(t) and 1/t for non-strongly convex and strongly convex settings, respectively. Similarly, even for subgradient descent (GD) when applied to non-smooth, convex functions, the best known step-size sequences still lead to O( T)-suboptimal convergence rates (on the final iterate). The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of last point of SGD as well as GD. We achieve this by designing a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average of SGD/GD with original sequence. We also show that our result holds with high-probability. We validate our results through simulations which demonstrate that the new step size sequence indeed improves the final iterate significantly compared to the standard step size sequences.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2021

The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence

Stochastic Gradient Descent (SGD) is among the simplest and most popular...
research
05/04/2022

Making SGD Parameter-Free

We develop an algorithm for parameter-free stochastic convex optimizatio...
research
06/18/2020

SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and Interpolation

We provide several convergence theorems for SGD for two large classes of...
research
06/20/2023

Convergence and concentration properties of constant step-size SGD through Markov chains

We consider the optimization of a smooth and strongly convex objective u...
research
07/26/2023

Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

Here we develop variants of SGD (stochastic gradient descent) with an ad...
research
05/10/2023

Convergence of a Normal Map-based Prox-SGD Method under the KL Inequality

In this paper, we present a novel stochastic normal map-based algorithm ...
research
07/22/2019

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Stochastic (sub)gradient methods require step size schedule tuning to pe...

Please sign up or login with your details

Forgot password? Click here to reset