Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

by   Guillaume Garrigos, et al.

Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values. In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization. We first detail an idealized adaptive method called _+ that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality. This _+ is a minor modification of the SPS (Stochastic Polyak Stepsize) method, where the step size is enforced to be positive. We then show that _+ achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop , a variant of _+ where the loss values at optimality are gradually learned, as opposed to being given. We give three viewpoints of , as a projection based method, as a variant of the prox-linear method, and then as a particular online SGD method. We then present a convergence analysis of and experimental results. The shortcomings of our work is that the convergence analysis of shows no advantage over SGD. Another shortcomming is that currently only the full batch version of shows a minor advantages of GD (Gradient Descent) in terms of sensitivity to the step size. The stochastic version shows no clear advantage over SGD. We conjecture that large mini-batches are required to make competitive. Currently the new method studied in this paper does not offer any clear theoretical or practical advantage. We have chosen to make this draft available online nonetheless because of some of the analysis techniques we use, such as the non-smooth analysis of _+, and also to show an apparently interesting approach that currently does not work.


page 5

page 8

page 11

page 17

page 24

page 29

page 32

page 35


Cutting Some Slack for SGD with Adaptive Polyak Stepsizes

Tuning the step size of stochastic gradient descent is tedious and error...

Adaptive Three Operator Splitting

We propose and analyze a novel adaptive step size variant of the Davis-Y...

Making the Last Iterate of SGD Information Theoretically Optimal

Stochastic gradient descent (SGD) is one of the most widely used algorit...

A Stochastic Proximal Polyak Step Size

Recently, the stochastic Polyak step size (SPS) has emerged as a competi...

On SGD's Failure in Practice: Characterizing and Overcoming Stalling

Stochastic Gradient Descent (SGD) is widely used in machine learning pro...

Don't be so Monotone: Relaxing Stochastic Line Search in Over-Parameterized Models

Recent works have shown that line search methods can speed up Stochastic...

Stochastic Polyak Stepsize with a Moving Target

We propose a new stochastic gradient method that uses recorded past loss...

Please sign up or login with your details

Forgot password? Click here to reset