Feature selection with gradient descent on two-layer networks in low-rotation regimes

08/04/2022
by   Matus Telgarsky, et al.
0

This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by 𝒪(√(m)), where m denotes the network width, which is in sharp contrast to the 𝒪(1) weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.

READ FULL TEXT
research
09/29/2022

Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

We study the problem of training a two-layer neural network (NN) of arbi...
research
09/26/2019

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Recent work has revealed that overparameterized networks trained by grad...
research
09/18/2022

Is Stochastic Gradient Descent Near Optimal?

The success of neural networks over the past decade has established them...
research
05/24/2022

Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width

Substantial work indicates that the dynamics of neural networks (NNs) is...
research
05/29/2023

Escaping mediocrity: how two-layer networks learn hard single-index models with SGD

This study explores the sample complexity for two-layer neural networks ...
research
03/31/2023

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks

In supervised learning, the regularization path is sometimes used as a c...
research
06/26/2022

Bounding the Width of Neural Networks via Coupled Initialization – A Worst Case Analysis

A common method in training neural networks is to initialize all the wei...

Please sign up or login with your details

Forgot password? Click here to reset