Towards Understanding Learning in Neural Networks with Linear Teachers

by   Roei Sarussi, et al.

Can a neural network minimizing cross-entropy learn linearly separable data? Despite progress in the theory of deep learning, this question remains unsolved. Here we prove that SGD globally optimizes this learning problem for a two-layer network with Leaky ReLU activations. The learned network can in principle be very complex. However, empirical evidence suggests that it often turns out to be approximately linear. We provide theoretical support for this phenomenon by proving that if network weights converge to two weight clusters, this will imply an approximately linear decision boundary. Finally, we show a condition on the optimization that leads to weight clustering. We provide empirical results that validate our theoretical analysis.


page 3

page 8

page 22

page 30


Over-parameterization Improves Generalization in the XOR Detection Problem

Empirical evidence suggests that neural networks with ReLU activations g...

On the Decision Boundary of Deep Neural Networks

While deep learning models and techniques have achieved great empirical ...

Adaptative clustering by minimization of the mixing entropy criterion

We present a clustering method and provide a theoretical analysis and an...

Regression as Classification: Influence of Task Formulation on Neural Network Features

Neural networks can be trained to solve regression problems by using gra...

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

The generalization mystery of overparametrized deep nets has motivated e...

How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks

We study how neural networks trained by gradient descent extrapolate, i....

Can convolutional ResNets approximately preserve input distances? A frequency analysis perspective

ResNets constrained to be bi-Lipschitz, that is, approximately distance ...

Please sign up or login with your details

Forgot password? Click here to reset