Disentangling feature and lazy learning in deep neural networks: an empirical study

06/19/2019
by   Mario Geiger, et al.
0

Two distinct limits for deep learning as the net width h→∞ have been proposed, depending on how the weights of the last layer scale with h. In the "lazy-learning" regime, the dynamics becomes linear in the weights and is described by a Neural Tangent Kernel Θ. By contrast, in the "feature-learning" regime, the dynamics can be expressed in terms of the density distribution of the weights. Understanding which regime describes accurately practical architectures and which one leads to better performance remains a challenge. We answer these questions and produce new characterizations of these regimes for the MNIST data set, by considering deep nets f whose last layer of weights scales as α/√(h) at initialization, where α is a parameter we vary. We performed systematic experiments on two setups (A) fully-connected Softplus momentum full batch and (B) convolutional ReLU momentum stochastic. We find that (1) α^*=1/√(h) separates the two regimes. (2) for (A) and (B) feature learning outperforms lazy learning, a difference in performance that decreases with h and becomes hardly detectable asymptotically for (A) but is very significant for (B). (3) In both regimes, the fluctuations δ f induced by initial conditions on the learned function follow δ f∼1/√(h), leading to a performance that increases with h. This improvement can be instead obtained at intermediate h values by ensemble averaging different networks. (4) In the feature regime there exists a time scale t_1∼α√(h), such that for t≪ t_1 the dynamics is linear. At t∼ t_1, the output has grown by a magnitude √(h) and the changes of the tangent kernel ΔΘ become significant. Ultimately, it follows ΔΘ∼(√(h)α)^-a for ReLU and Softplus activation, with a<2 & a→2 when depth grows.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/29/2021

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit

To theoretically understand the behavior of trained deep neural networks...
research
07/22/2020

Compressing invariant manifolds in neural nets

We study how neural networks compress uninformative input space in model...
research
12/30/2020

Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training

Deep learning algorithms are responsible for a technological revolution ...
research
06/18/2019

Gradient Dynamics of Shallow Univariate ReLU Networks

We present a theoretical and empirical study of the gradient dynamics of...
research
12/27/2021

Depth and Feature Learning are Provably Beneficial for Neural Network Discriminators

We construct pairs of distributions μ_d, ν_d on ℝ^d such that the quanti...
research
09/19/2022

Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

Among attempts at giving a theoretical account of the success of deep ne...
research
07/27/2023

Speed Limits for Deep Learning

State-of-the-art neural networks require extreme computational power to ...

Please sign up or login with your details

Forgot password? Click here to reset