ReLU soothes the NTK condition number and accelerates optimization for wide neural networks

by   Chaoyue Liu, et al.

Rectified linear unit (ReLU), as a non-linear activation function, is well known to improve the expressivity of neural networks such that any continuous function can be approximated to arbitrary precision by a sufficiently wide neural network. In this work, we present another interesting and important feature of ReLU activation function. We show that ReLU leads to: better separation for similar data, and better conditioning of neural tangent kernel (NTK), which are closely related. Comparing with linear neural networks, we show that a ReLU activated wide neural network at random initialization has a larger angle separation for similar data in the feature space of model gradient, and has a smaller condition number for NTK. Note that, for a linear neural network, the data separation and NTK condition number always remain the same as in the case of a linear model. Furthermore, we show that a deeper ReLU network (i.e., with more ReLU activation operations), has a smaller NTK condition number than a shallower one. Our results imply that ReLU activation, as well as the depth of ReLU network, helps improve the gradient descent convergence rate, which is closely related to the NTK condition number.


page 1

page 2

page 3

page 4


E-swish: Adjusting Activations to Different Network Depths

Activation functions have a notorious impact on neural networks on both ...

Neural Characteristic Activation Value Analysis for Improved ReLU Network Feature Learning

We examine the characteristic activation values of individual ReLU units...

On the relationship between multivariate splines and infinitely-wide neural networks

We consider multivariate splines and show that they have a random featur...

Random Bias Initialization Improving Binary Neural Network Training

Edge intelligence especially binary neural network (BNN) has attracted c...

Global Convergence of Sobolev Training for Overparametrized Neural Networks

Sobolev loss is used when training a network to approximate the values a...

Traversing the Local Polytopes of ReLU Neural Networks: A Unified Approach for Network Verification

Although neural networks (NNs) with ReLU activation functions have found...

Activate or Not: Learning Customized Activation

Modern activation layers use non-linear functions to activate the neuron...

Please sign up or login with your details

Forgot password? Click here to reset