SparCML: High-Performance Sparse Communication for Machine Learning

by   Cedric Renggli, et al.

One of the main drivers behind the rapid recent advances in machine learning has been the availability of efficient system support. This comes both through hardware acceleration, but also in the form of efficient software frameworks and programming models. Despite significant progress, scaling compute-intensive machine learning workloads to a large number of compute nodes is still a challenging task, with significant latency and bandwidth demands. In this paper, we address this challenge, by proposing SPARCML, a general, scalable communication layer for machine learning applications. SPARCML is built on the observation that many distributed machine learning algorithms either have naturally sparse communication patters, or have updates which can be sparsified in a structured way for improved performance, without any convergence or accuracy loss. To exploit this insight, we design and implement a set of communication efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute sparse input data vectors, of heterogeneous sizes. We call these operations sparse-input collectives, and present efficient practical algorithms with strong theoretical bounds on their running time and communication cost. Our generic communication layer is enriched with additional features, such support for non-blocking (asynchronous) operations, and support for low-precision data representations. We validate our algorithmic results experimentally on a range of large-scale machine learning applications and target architectures, showing that we can leverage sparsity for order- of-magnitude runtime savings, compared to state-of-the art methods and frameworks.


page 1

page 2

page 3

page 4


Hoplite: Efficient Collective Communication for Task-Based Distributed Systems

Collective communication systems such as MPI offer high performance grou...

Relaxed Scheduling for Scalable Belief Propagation

The ability to leverage large-scale hardware parallelism has been one of...

Addressing Algorithmic Bottlenecks in Elastic Machine Learning with Chicle

Distributed machine learning training is one of the most common and impo...

Multiplierless and Sparse Machine Learning based on Margin Propagation Networks

The new generation of machine learning processors have evolved from mult...

Quantizing data for distributed learning

We consider machine learning applications that train a model by leveragi...

BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism

Neural network frameworks such as PyTorch and TensorFlow are the workhor...

Towards Geo-Distributed Machine Learning

Latency to end-users and regulatory requirements push large companies to...

Please sign up or login with your details

Forgot password? Click here to reset