Speeding up Deep Learning with Transient Servers

by   Shijian Li, et al.
Institute of Software, Chinese Academy of Sciences
Worcester Polytechnic Institute

Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable---e.g., for rapidly evaluating new model designs---they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the high costs. We conduct the first large-scale empirical analysis, launching more than a thousand GPU servers of various capacities, aimed at understanding the characteristics of transient GPU servers and their impact on distributed training performance. Our study demonstrates the potential of transient servers with a speedup of 7.7X with more than 62.9 configurations. We also identify a number of important challenges and opportunities for redesigning distributed training frameworks to be transient-aware. For example, the dynamic cost and availability characteristics of transient servers suggest the need for frameworks to dynamically change cluster configurations to best take advantage of current conditions.


page 2

page 5

page 9


Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Cloud GPU servers have become the de facto way for deep learning practit...

Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching

Current techniques and systems for distributed model training mostly ass...

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of...

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

A graph neural network (GNN) enables deep learning on structured graph d...

Odyssey: A Journey in the Land of Distributed Data Series Similarity Search

This paper presents Odyssey, a novel distributed data-series processing ...

Please sign up or login with your details

Forgot password? Click here to reset