Collegial Ensembles

06/13/2020
by   Etai Littwin, et al.
0

Modern neural network performance typically improves as model size increases. A recent line of research on the Neural Tangent Kernel (NTK) of over-parameterized networks indicates that the improvement with size increase is a product of a better conditioned loss landscape. In this work, we investigate a form of over-parameterization achieved through ensembling, where we define collegial ensembles (CE) as the aggregation of multiple independent models with identical architectures, trained as a single model. We show that the optimization dynamics of CE simplify dramatically when the number of models in the ensemble is large, resembling the dynamics of wide models, yet scale much more favorably. We use recent theoretical results on the finite width corrections of the NTK to perform efficient architecture search in a space of finite width CE that aims to either minimize capacity, or maximize trainability under a set of constraints. The resulting ensembles can be efficiently implemented in practical architectures using group convolutions and block diagonal layers. Finally, we show how our framework can be used to analytically derive optimal group convolution modules originally found using expensive grid searches, without having to train a single model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2022

Embedded Ensembles: Infinite Width Limit and Operating Regimes

A memory efficient approach to ensembling neural networks is to share mo...
research
07/11/2020

Bayesian Deep Ensembles via the Neural Tangent Kernel

We explore the link between deep ensembles and Gaussian processes (GPs) ...
research
06/17/2022

Fast Finite Width Neural Tangent Kernel

The Neural Tangent Kernel (NTK), defined as Θ_θ^f(x_1, x_2) = [∂ f(θ, x_...
research
01/21/2020

On the infinite width limit of neural networks with a standard parameterization

There are currently two parameterizations used to derive fixed kernels c...
research
05/14/2020

Deep Ensembles on a Fixed Memory Budget: One Wide Network or Several Thinner Ones?

One of the generally accepted views of modern deep learning is that incr...
research
12/02/2019

GroSS: Group-Size Series Decomposition for Whole Search-Space Training

We present Group-size Series (GroSS) decomposition, a mathematical formu...
research
06/23/2021

Bayesian Deep Learning Hyperparameter Search for Robust Function Mapping to Polynomials with Noise

Advances in neural architecture search, as well as explainability and in...

Please sign up or login with your details

Forgot password? Click here to reset