Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks

12/08/2017
by   Shankar Krishnan, et al.
0

Progress in deep learning is slowed by the days or weeks it takes to train large models. The natural solution of using more hardware is limited by diminishing returns, and leads to inefficient use of additional resources. In this paper, we present a large batch, stochastic optimization algorithm that is both faster than widely used algorithms for fixed amounts of computation, and also scales up substantially better as more computational resources become available. Our algorithm implicitly computes the inverse Hessian of each mini-batch to produce descent directions; we do so without either an explicit approximation to the Hessian or Hessian-vector products. We demonstrate the effectiveness of our algorithm by successfully training large ImageNet models (Inception-V3, Resnet-50, Resnet-101 and Inception-Resnet-V2) with mini-batch sizes of up to 32000 with no loss in validation error relative to current baselines, and no increase in the total number of steps. At smaller mini-batch sizes, our optimizer improves the validation error in these models by 0.8-0.9 Alternatively, we can trade off this accuracy to reduce the number of training steps needed by roughly 10-30 others -- only one hyperparameter (learning rate) needs tuning, and furthermore, the algorithm is as computationally cheap as the commonly used Adam optimizer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2018

Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs

Large-scale distributed training of deep neural networks suffer from the...
research
02/13/2020

Scalable and Practical Natural Gradient for Large-Scale Deep Learning

Large-scale distributed training of deep neural networks results in mode...
research
12/07/2018

Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training

Nonlinear conjugate gradient (NLCG) based optimizers have shown superior...
research
01/24/2023

Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization

We propose ActiveLR, an optimization meta algorithm that localizes the l...
research
08/26/2021

The Number of Steps Needed for Nonconvex Optimization of a Deep Learning Optimizer is a Rational Function of Batch Size

Recently, convergence as well as convergence rate analyses of deep learn...
research
12/30/2020

Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

Distributed deep learning is an effective way to reduce the training tim...
research
06/25/2021

Ranger21: a synergistic deep learning optimizer

As optimizers are critical to the performances of neural networks, every...

Please sign up or login with your details

Forgot password? Click here to reset