Locally Random Alloy Codes with Channel Coding Theorems for Distributed Matrix Multiplication
Matrix multiplication is a fundamental operation in machine learning and is commonly distributed into multiple parallel tasks for large datasets. Stragglers and other failures can severely impact the overall completion time. Recent works in coded computing provide a novel strategy to mitigate stragglers with coded tasks, with an objective of minimizing the number of tasks needed to recover the overall result, known as the recovery threshold. However, we demonstrate that this combinatorial definition does not directly optimize the probability of failure. In this paper, we introduce a novel analytical metric, which focuses on the most likely event and measures the optimality of a coding scheme by its probability of decoding. Our general framework encompasses many other computational schemes and metrics as a special case. Far from being a purely theoretical construction, these definitions lead us to a practical construction of random codes for matrix multiplication, i.e., locally random alloy codes, which are optimal with respect to the measures. We present experimental results on Amazon EC2 which empirically demonstrate the improvement in terms of running time and numerical stability relative to well-established benchmarks.
READ FULL TEXT