When Expressivity Meets Trainability: Fewer than n Neurons Can Work

by   Jiawei Zhang, et al.

Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1-hidden-layer networks with fewer than n (sample size) neurons when the activation is smooth. First, we prove that as long as the width m ≥ 2n/d (where d is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.


page 9

page 34

page 35


Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks

The optimization of multilayer neural networks typically leads to a solu...

Collapse of Deep and Narrow Neural Nets

Recent theoretical work has demonstrated that deep neural networks have ...

Stronger Convergence Results for Deep Residual Networks: Network Width Scales Linearly with Training Data Size

Deep neural networks are highly expressive machine learning models with ...

Penalized Projected Kernel Calibration for Computer Models

Projected kernel calibration is known to be theoretically superior, its ...

On Convergence of Training Loss Without Reaching Stationary Points

It is a well-known fact that nonconvex optimization is computationally i...

Global Convergence Analysis of Deep Linear Networks with A One-neuron Layer

In this paper, we follow Eftekhari's work to give a non-local convergenc...

Please sign up or login with your details

Forgot password? Click here to reset