Scaling up Stochastic Gradient Descent for Non-convex Optimisation

by   Saad Mohamad, et al.

Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions and large datasets. We address the bottleneck problem arising when using both shared and distributed memory. Typically, the former is bounded by limited computation resources and bandwidth whereas the latter suffers from communication overheads. We propose a unified distributed and parallel implementation of SGD (named DPSGD) that relies on both asynchronous distribution and lock-free parallelism. By combining two strategies into a unified framework, DPSGD is able to strike a better trade-off between local computation and communication. The convergence properties of DPSGD are studied for non-convex problems such as those arising in statistical modelling and machine learning. Our theoretical analysis shows that DPSGD leads to speed-up with respect to the number of cores and number of workers while guaranteeing an asymptotic convergence rate of O(1/√(T)) given that the number of cores is bounded by T^1/4 and the number of workers is bounded by T^1/2 where T is the number of iterations. The potential gains that can be achieved by DPSGD are demonstrated empirically on a stochastic variational inference problem (Latent Dirichlet Allocation) and on a deep reinforcement learning (DRL) problem (advantage actor critic - A2C) resulting in two algorithms: DPSVI and HSA2C. Empirical results validate our theoretical findings. Comparative studies are conducted to show the performance of the proposed DPSGD against the state-of-the-art DRL algorithms.


Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD

Large scale machine learning is increasingly relying on distributed opti...

On the Convergence of Memory-Based Distributed SGD

Distributed stochastic gradient descent (DSGD) has been widely used for ...

Convergence Analysis of Homotopy-SGD for non-convex optimization

First-order stochastic methods for solving large-scale non-convex optimi...

Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...

Asynchronous Stochastic Variational Inference

Stochastic variational inference (SVI) employs stochastic optimization t...

IntSGD: Floatless Compression of Stochastic Gradients

We propose a family of lossy integer compressions for Stochastic Gradien...

Asynchronous Fully-Decentralized SGD in the Cluster-Based Model

This paper presents fault-tolerant asynchronous Stochastic Gradient Desc...

Please sign up or login with your details

Forgot password? Click here to reset