Differentially private training of residual networks with scale normalisation
We investigate the optimal choice of replacement layer for Batch Normalisation (BN) in residual networks (ResNets) for training with Differentially Private Stochastic Gradient Descent (DP-SGD) and study the phenomenon of scale mixing in residual blocks, whereby the activations on the two branches are scaled differently. Our experimental evaluation indicates that a hyperparameter search over 1-64 Group Normalisation (GN) groups improves the accuracy of ResNet-9 and ResNet-50 considerably in both benchmark (CIFAR-10) and large-image (ImageNette) tasks. Moreover, Scale Normalisation, a simple modification to the model architecture by which an additional normalisation layer is introduced after the residual block's addition operation further improves the utility of ResNets allowing us to achieve state-of-the-art results on CIFAR-10.
READ FULL TEXT