Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

06/16/2022
by   Anastasia Koloskova, et al.
0

We study the asynchronous stochastic gradient descent algorithm for distributed training over n workers which have varying computation and communication frequency over time. In this algorithm, workers compute stochastic gradients in parallel at their own pace and return those to the server without any synchronization. Existing convergence rates of this algorithm for non-convex smooth objectives depend on the maximum gradient delay τ_max and show that an ϵ-stationary point is reached after 𝒪(σ^2ϵ^-2+ τ_maxϵ^-1) iterations, where σ denotes the variance of stochastic gradients. In this work (i) we obtain a tighter convergence rate of 𝒪(σ^2ϵ^-2+ √(τ_maxτ_avg)ϵ^-1) without any change in the algorithm where τ_avg is the average delay, which can be significantly smaller than τ_max. We also provide (ii) a simple delay-adaptive learning rate scheme, under which asynchronous SGD achieves a convergence rate of 𝒪(σ^2ϵ^-2+ τ_avgϵ^-1), and does not require any extra hyperparameter tuning nor extra communications. Our result allows to show for the first time that asynchronous SGD is always faster than mini-batch SGD. In addition, (iii) we consider the case of heterogeneous functions motivated by federated learning applications and improve the convergence rate by proving a weaker dependence on the maximum delay compared to prior works. In particular, we show that the heterogeneity term in convergence rate is only affected by the average delay within each worker.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2019

The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication

We analyze (stochastic) gradient descent (SGD) with delayed updates on s...
research
09/07/2023

Convergence Analysis of Decentralized ASGD

Over the last decades, Stochastic Gradient Descent (SGD) has been intens...
research
06/22/2021

Asynchronous Stochastic Optimization Robust to Arbitrary Delays

We consider stochastic optimization with delayed gradients where, at eac...
research
03/02/2020

BASGD: Buffered Asynchronous SGD for Byzantine Learning

Distributed learning has become a hot research topic, due to its wide ap...
research
02/20/2020

Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

We study distributed optimization algorithms for minimizing the average ...
research
10/06/2017

Accumulated Gradient Normalization

This work addresses the instability in asynchronous data parallel optimi...
research
05/31/2022

Asynchronous Hierarchical Federated Learning

Federated Learning is a rapidly growing area of research and with variou...

Please sign up or login with your details

Forgot password? Click here to reset