Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence

02/17/2021
by   Karl Bäckström, et al.
0

Stochastic gradient descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous parallel shared-memory SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel SGD for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyper-parameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for training multilayer perceptrons (MLP) and convolutional neural networks (CNN). We observe the crucial impact of contention, staleness and consistency and show how Leashed-SGD provides significant improvements in stability as well as wall-clock time to convergence (from 20-80 the standard lock-based AsyncSGD algorithm and HOGWILD!, while reducing the overall memory footprint.

READ FULL TEXT
research
01/16/2020

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

Machine learning has made tremendous progress in recent years, with mode...
research
11/08/2019

MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is very useful in optimization problem...
research
03/23/2018

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine ...
research
01/17/2021

Guided parallelized stochastic gradient descent for delay compensation

Stochastic gradient descent (SGD) algorithm and its variations have been...
research
06/23/2015

On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants

We study optimization algorithms based on variance reduction for stochas...
research
04/16/2021

Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning

Stochastic Gradient Descent (SGD) has become the de facto way to train d...
research
01/16/2013

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Recent work has established an empirically successful framework for adap...

Please sign up or login with your details

Forgot password? Click here to reset