The Benefits of Implicit Regularization from SGD in Least Squares Problems

08/10/2021
by   Difan Zou, et al.
0

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.

READ FULL TEXT
research
03/23/2021

Benign Overfitting of Constant-Stepsize SGD for Linear Regression

There is an increasing realization that algorithmic inductive biases are...
research
09/29/2020

Benign overfitting in ridge regression

Classical learning theory suggests that strong regularization is needed ...
research
12/01/2022

Regularization with Fake Features

Recent successes of massively overparameterized models have inspired a n...
research
05/28/2018

Implicit ridge regularization provided by the minimum-norm least squares estimator when n≪ p

A conventional wisdom in statistical learning is that large models requi...
research
03/07/2022

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

Stochastic gradient descent (SGD) has achieved great success due to its ...
research
08/26/2021

Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?

Modern methods for learning from data depend on many tuning parameters, ...
research
02/10/2021

On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

The noise in stochastic gradient descent (SGD), caused by minibatch samp...

Please sign up or login with your details

Forgot password? Click here to reset