Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients?
We give a rigorous analysis of the statistical behavior of gradients in randomly initialized feed-forward networks with ReLU activations. Our results show that a fully connected depth d ReLU net with hidden layer widths n_j will have exploding and vanishing gradients if and only if ∑_j=1^d-1 1/n_j is large. The point of view of this article is that whether a given neural net will have exploding/vanishing gradients is a function mainly of the architecture of the net, and hence can be tested at initialization. Our results imply that a fully connected network that produces manageable gradients at initialization must have many hidden layers that are about as wide as the network is deep. This work is related to the mean field theory approach to random neural nets. From this point of view, we give a rigorous computation of the 1/n_j corrections to the propagation of gradients at the so-called edge of chaos.
READ FULL TEXT