Dependability in a Multi-tenant Multi-framework Deep Learning as-a-Service Platform

05/17/2018
by   Scott Boag, et al.
0

Deep learning (DL), a form of machine learning, is becoming increasingly popular in several application domains. As a result, cloud-based Deep Learning as a Service (DLaaS) platforms have become an essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper explores dependability in the context of a DLaaS platform used in IBM. We begin by explaining how DL training workloads are different, and what features ensure dependability in this context. We then describe the architecture, design and implementation of a cloud-based orchestration system for DL training. We show how this system has been architected with dependability in mind while also being horizontally scalable, elastic, flexible and efficient. We also present an initial empirical evaluation of the overheads introduced by our platform, and discuss tradeoffs between efficiency and dependability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/14/2019

FfDL : A Flexible Multi-tenant Deep Learning Platform

Deep learning (DL) is becoming increasingly popular in several applicati...
research
05/05/2017

SLDR-DL: A Framework for SLD-Resolution with Deep Learning

This paper introduces an SLD-resolution technique based on deep learning...
research
06/24/2020

Effective Elastic Scaling of Deep Learning Workloads

The increased use of deep learning (DL) in academia, government and indu...
research
04/07/2022

Elastic Model Aggregation with Parameter Service

Model aggregation, the process that updates model parameters, is an impo...
research
12/17/2021

Exploring the Impact of Virtualization on the Usability of the Deep Learning Applications

Deep Learning-based (DL) applications are becoming increasingly popular ...
research
01/07/2020

High Performance I/O For Large Scale Deep Learning

Training deep learning (DL) models on petascale datasets is essential fo...
research
01/17/2022

VELTAIR: Towards High-Performance Multi-tenant Deep Learning Services via Adaptive Compilation and Scheduling

Deep learning (DL) models have achieved great success in many applicatio...

Please sign up or login with your details

Forgot password? Click here to reset