Scalable Adaptive Stochastic Optimization Using Random Projections

11/21/2016
by   Gabriel Krummenacher, et al.
0

Adaptive stochastic gradient methods such as AdaGrad have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by accumulating past gradients which are used to tune the step size adaptively. In certain situations the full-matrix variant of AdaGrad is expected to attain better performance, however in high dimensions it is computationally impractical. We present Ada-LR and RadaGrad two computationally efficient approximations to full-matrix AdaGrad based on randomized dimensionality reduction. They are able to capture dependencies between features and achieve similar performance to full-matrix AdaGrad but at a much smaller computational cost. We show that the regret of Ada-LR is close to the regret of full-matrix AdaGrad which can have an up-to exponentially smaller dependence on the dimension than the diagonal variant. Empirically, we show that Ada-LR and RadaGrad perform similarly to full-matrix AdaGrad. On the task of training convolutional neural networks as well as recurrent neural networks, RadaGrad achieves faster convergence than diagonal AdaGrad.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2022

A Mini-Block Natural Gradient Method for Deep Neural Networks

The training of deep neural networks (DNNs) is currently predominantly d...
research
05/31/2016

Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks

Adaptive learning rate algorithms such as RMSProp are widely used for tr...
research
11/21/2020

A Trace-restricted Kronecker-Factored Approximation to Natural Gradient

Second-order optimization methods have the ability to accelerate converg...
research
03/31/2023

Analysis and Comparison of Two-Level KFAC Methods for Training Deep Neural Networks

As a second-order method, the Natural Gradient Descent (NGD) has the abi...
research
03/09/2023

Scalable Stochastic Gradient Riemannian Langevin Dynamics in Non-Diagonal Metrics

Stochastic-gradient sampling methods are often used to perform Bayesian ...
research
09/12/2016

CompAdaGrad: A Compressed, Complementary, Computationally-Efficient Adaptive Gradient Method

The adaptive gradient online learning method known as AdaGrad has seen w...
research
09/12/2023

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

Shampoo is an online and stochastic optimization algorithm belonging to ...

Please sign up or login with your details

Forgot password? Click here to reset