Neural Networks Efficiently Learn Low-Dimensional Representations with SGD

09/29/2022
by   Alireza Mousavi Hosseini, et al.
0

We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input x∈ℝ^d is Gaussian and the target y ∈ℝ follows a multiple-index model, i.e., y=g(⟨u_1,x⟩,...,⟨u_k,x⟩) with a noisy link function g. We prove that the first-layer weights of the NN converge to the k-dimensional principal subspace spanned by the vectors u_1,...,u_k of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when k ≪ d. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of 𝒪(√(kd/T)) after T iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form y=f(⟨u,x⟩) + ϵ by recovering the principal direction, with a sample complexity linear in d (up to log factors), where f is a monotonic function with at most polynomial growth, and ϵ is the noise. This is in contrast to the known d^Ω(p) sample requirement to learn any degree p polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2023

Escaping mediocrity: how two-layer networks learn hard single-index models with SGD

This study explores the sample complexity for two-layer neural networks ...
research
08/04/2022

Feature selection with gradient descent on two-layer networks in low-rotation regimes

This work establishes low test error of gradient flow (GF) and stochasti...
research
06/24/2020

When Do Neural Networks Outperform Kernel Methods?

For a certain scaling of the initialization of stochastic gradient desce...
research
04/28/2020

Learning Polynomials of Few Relevant Dimensions

Polynomial regression is a basic primitive in learning and statistics. I...
research
02/27/2017

SGD Learns the Conjugate Kernel Class of the Network

We show that the standard stochastic gradient decent (SGD) algorithm is ...
research
10/27/2022

Learning Single-Index Models with Shallow Neural Networks

Single-index models are a class of functions given by an unknown univari...

Please sign up or login with your details

Forgot password? Click here to reset