A Principle of Least Action for the Training of Neural Networks

by   Skander Karkar, et al.

Neural networks have been achieving high generalization performance on many tasks despite being highly over-parameterized. Since classical statistical learning theory struggles to explain this behavior, much effort has recently been focused on uncovering the mechanisms behind it, in the hope of developing a more adequate theoretical framework and having a better control over the trained models. In this work, we adopt an alternate perspective, viewing the neural network as a dynamical system displacing input particles over time. We conduct a series of experiments and, by analyzing the network's behavior through its displacements, we show the presence of a low kinetic energy displacement bias in the transport map of the network, and link this bias with generalization performance. From this observation, we reformulate the learning problem as follows: finding neural networks which solve the task while transporting the data as efficiently as possible. This offers a novel formulation of the learning problem which allows us to provide regularity results for the solution network, based on Optimal Transport theory. From a practical viewpoint, this allows us to propose a new learning algorithm, which automatically adapts to the complexity of the given task, and leads to networks with a high generalization ability even in low data regimes.


page 20

page 21


Instance-Dependent Generalization Bounds via Optimal Transport

Existing generalization bounds fail to explain crucial factors that driv...

Towards Understanding the Spectral Bias of Deep Learning

An intriguing phenomenon observed during training neural networks is the...

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Many tasks in machine learning and signal processing can be solved by mi...

On the (Non-)Robustness of Two-Layer Neural Networks in Different Learning Regimes

Neural networks are known to be highly sensitive to adversarial examples...

Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective

The lottery ticket hypothesis (LTH) has attracted attention because it c...

Large-time asymptotics in deep learning

It is by now well-known that practical deep supervised learning may roug...

Practically Solving LPN in High Noise Regimes Faster Using Neural Networks

We conduct a systematic study of solving the learning parity with noise ...

Please sign up or login with your details

Forgot password? Click here to reset