Straggler Mitigation at Scale

06/25/2019
by   Mehmet Fatih Aktas, et al.
0

Runtime performance variability at the servers has been a major issue, hindering the predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers has been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed the pain and gain of redundancy under various service models and assumptions on the runtime variability. This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy. Tail heaviness of service time variability is decisive on the pain and gain of redundancy and we quantify its effect by deriving expressions for the cost and latency. Specifically, we try to answer four questions: 1) How do replicated and coded redundancy compare in the cost vs. latency tradeoff? 2) Can we introduce redundancy after waiting some time and expect to reduce the cost? 3) Can relaunching the tasks that appear to be straggling after some time help to reduce cost and/or latency? 4) Is it effective to use redundancy and relaunching together? We validate the answers we found for each of the questions via simulations that use empirical distributions extracted from a Google cluster data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2017

Straggler Mitigation by Delayed Relaunch of Tasks

Redundancy for straggler mitigation, originally in data download and mor...
research
10/02/2017

Effective Straggler Mitigation: Which Clones Should Attack and When?

Redundancy for straggler mitigation, originally in data download and mor...
research
03/24/2021

Comparison of the FCFS and PS discipline in Redundancy Systems

We consider the c.o.c. redundancy system with N parallel servers where i...
research
07/06/2018

Faster Data-access in Large-scale Systems: Network-scale Latency Analysis under General Service-time Distributions

In cloud storage systems with a large number of servers, files are typic...
research
06/12/2019

Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation

Runtime variability in computing systems causes some tasks to straggle a...
research
03/17/2021

A Survey of Stability Results for Redundancy Systems

Redundancy mechanisms consist in sending several copies of a same job to...
research
06/05/2019

Collage Inference: Achieving low tail latency during distributed image classification using coded redundancy models

Reducing the latency variance in machine learning inference is a key req...

Please sign up or login with your details

Forgot password? Click here to reset