Don't Decay the Learning Rate, Increase the Batch Size

11/01/2017
by   Samuel L. Smith, et al.
0

It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate ϵ and scaling the batch size B ∝ϵ. Finally, one can increase the momentum coefficient m and scale B ∝ 1/(1-m), although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train Inception-ResNet-V2 on ImageNet to 77% validation accuracy in under 2500 parameter updates, efficiently utilizing training batches of 65536 images.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2022

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

We analyze the dynamics of large batch stochastic gradient descent with ...
research
08/13/2017

Large Batch Training of Convolutional Networks

A common way to speed up training of large convolutional networks is to ...
research
03/26/2018

A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Although deep learning has produced dazzling successes for applications ...
research
02/26/2020

Stagewise Enlargement of Batch Size for SGD-based Learning

Existing research shows that the batch size can seriously affect the per...
research
04/26/2019

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

With an increasing demand for training powers for deep learning algorith...
research
08/03/2022

Empirical Study of Overfitting in Deep FNN Prediction Models for Breast Cancer Metastasis

Overfitting is defined as the fact that the current model fits a specifi...
research
10/11/2019

Decaying momentum helps neural network training

Momentum is a simple and popular technique in deep learning for gradient...

Please sign up or login with your details

Forgot password? Click here to reset