Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning

12/16/2022
by   Luis Maßny, et al.
0

We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze the latency of our proposed scheme and show that it has a significantly lower latency than gradient codes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2023

Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

In distributed machine learning, a central node outsources computational...
research
11/24/2022

Sequential Gradient Coding For Straggler Mitigation

In distributed computing, slower nodes (stragglers) usually become a bot...
research
07/12/2017

Gradient Coding from Cyclic MDS Codes and Expander Graphs

Gradient Descent, and its variants, are a popular method for solving emp...
research
03/23/2023

Trading Communication for Computation in Byzantine-Resilient Gradient Coding

We consider gradient coding in the presence of an adversary, controlling...
research
06/08/2020

Adaptive Gradient Coding

This paper focuses on mitigating the impact of stragglers in distributed...
research
02/07/2018

Minimizing Latency for Secure Coded Computing Using Secret Sharing via Staircase Codes

We consider the setting of a Master server, M, who possesses confidentia...
research
05/16/2022

Two-Stage Coded Federated Edge Learning: A Dynamic Partial Gradient Coding Perspective

Federated edge learning (FEL) can training a global model from terminal ...

Please sign up or login with your details

Forgot password? Click here to reset