Is Stochastic Gradient Descent Near Optimal?

09/18/2022
by   Yifan Zhu, et al.
0

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with W parameters, an optimal learner needs only Õ(W/ϵ) samples to attain expected error ϵ. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Joen Van Roy (arXiv:2203.00246) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2019

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Deep neural networks achieve stellar generalisation even when they have ...
research
07/11/2023

Fundamental limits of overparametrized shallow neural networks for supervised learning

We carry out an information-theoretical analysis of a two-layer neural n...
research
07/14/2017

On the Complexity of Learning Neural Networks

The stunning empirical successes of neural networks currently lack rigor...
research
08/04/2022

Feature selection with gradient descent on two-layer networks in low-rotation regimes

This work establishes low test error of gradient flow (GF) and stochasti...
research
03/07/2019

Limiting Network Size within Finite Bounds for Optimization

Largest theoretical contribution to Neural Networks comes from VC Dimens...
research
12/02/2022

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

We study the compute-optimal trade-off between model and training data s...
research
05/29/2023

Escaping mediocrity: how two-layer networks learn hard single-index models with SGD

This study explores the sample complexity for two-layer neural networks ...

Please sign up or login with your details

Forgot password? Click here to reset