A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

by   Ben Adlam, et al.

One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions. While this paradigm has inspired significant research on the properties of large networks, relatively little work has been devoted to the fact that these networks are often used to model large complex datasets, which may themselves contain millions or even billions of constraints. In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity. We analyze the performance of a simple regression model trained on the random features F=f(WX+B) for a random weight matrix W and random bias vector B, obtaining an exact formula for the asymptotic training error on a noisy autoencoding task. The role of the bias can be understood as parameterizing a distribution over activation functions, and our analysis directly generalizes to such distributions, even those not expressible with a traditional additive bias. Intriguingly, we find that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoecndoing task, suggesting that mixtures of nonlinearities might be useful for approximate kernel methods or neural network architecture design.


Neural networks with trainable matrix activation functions

The training process of neural networks usually optimize weights and bia...

Neural Network Architecture Search with Differentiable Cartesian Genetic Programming for Regression

The ability to design complex neural network architectures which enable ...

Evolving Parsimonious Networks by Mixing Activation Functions

Neuroevolution methods evolve the weights of a neural network, and in so...

Bias-variance decomposition of overparameterized regression with random linear features

In classical statistics, the bias-variance trade-off describes how varyi...

Neural Mixture Distributional Regression

We present neural mixture distributional regression (NMDR), a holistic f...

Optimal Activation Functions for the Random Features Regression Model

The asymptotic mean squared test error and sensitivity of the Random Fea...

Additive function approximation in the brain

Many biological learning systems such as the mushroom body, hippocampus,...

Please sign up or login with your details

Forgot password? Click here to reset