Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

05/23/2023
by   Achraf Bahamou, et al.
0

We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization methods for minimizing empirical loss functions in deep learning, eliminating the need for the user to tune the learning rate (LR). The proposed approach exploits the layer-wise stochastic curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer. The method has memory requirements that are comparable to those of first-order methods, while its per-iteration time complexity is only increased by an amount that is roughly equivalent to an additional gradient computation. Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules and outperform fine-tuned LR versions of these methods as well as popular first-order and second-order algorithms for training DNNs on Autoencoder, Convolutional Neural Network (CNN) and Graph Convolutional Network (GCN) models. Finally, it is proved that an idealized version of SGD with the layer-wise step sizes converges linearly when using full-batch gradients.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2019

A Dynamic Sampling Adaptive-SGD Method for Machine Learning

We propose a stochastic optimization method for minimizing loss function...
research
06/11/2022

Parameter Convex Neural Networks

Deep learning utilizing deep neural networks (DNNs) has achieved a lot o...
research
02/12/2021

Kronecker-factored Quasi-Newton Methods for Convolutional Neural Networks

Second-order methods have the capability of accelerating optimization by...
research
12/16/2020

Study on the Large Batch Size Training of Neural Networks Based on the Second Order Gradient

Large batch size training in deep neural networks (DNNs) possesses a wel...
research
11/27/2020

Improving Layer-wise Adaptive Rate Methods using Trust Ratio Clipping

Training neural networks with large batch is of fundamental significance...
research
03/24/2022

On Exploiting Layerwise Gradient Statistics for Effective Training of Deep Neural Networks

Adam and AdaBelief compute and make use of elementwise adaptive stepsize...
research
12/17/2022

Improving Levenberg-Marquardt Algorithm for Neural Networks

We explore the usage of the Levenberg-Marquardt (LM) algorithm for regre...

Please sign up or login with your details

Forgot password? Click here to reset