Communication Scheduling as a First-Class Citizen in Distributed Machine Learning Systems

by   Sayed Hadi Hashemi, et al.

State-of-the-art machine learning systems rely on graph-based models, with the distributed training of these models being the norm in AI-powered production pipelines. The performance of these communication-heavy systems depends on the effective overlap of communication and computation. While the overlap challenge has been addressed in systems with simpler model representations, it remains an open problem in graph-based models. In this work, we develop a system for communication scheduling which realizes near-optimal overlap of communication and computation in graph-based models. Our system is implemented over TensorFlow and requires no changes in the model or developer inputs. Our system improves the throughput by up to 82 inference and 20 2.8x. A part of our implementation is already merged with TensorFlow codebase; the rest is publicly available.


TicTac: Accelerating Distributed Deep Learning with Communication Scheduling

State-of-the-art deep learning systems rely on iterative distributed tra...

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow is an interface for expressing machine learning algorithms, a...

Private Machine Learning in TensorFlow using Secure Computation

We present a framework for experimenting with secure multi-party computa...

AutoGraph: Imperative-style Coding with Graph-based Performance

There is a perceived trade-off between machine learning code that is eas...

Dynamic Control Flow in Large-Scale Machine Learning

Many recent machine learning models rely on fine-grained dynamic control...

Probing Graph Representations

Today we have a good theoretical understanding of the representational p...

Distributed Learning Systems with First-order Methods

Scalable and efficient distributed learning is one of the main driving f...

Please sign up or login with your details

Forgot password? Click here to reset