Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

10/16/2019
by   Davit Buniatyan, et al.
0

Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid distributed cloud framework with a unified view to multiple clouds and an on-premise infrastructure for processing tasks using both CPU and GPU compute instances at scale. The system implements a distributed file system and failure-tolerant task processing scheduler, independent of the language and Deep Learning framework used. It allows to utilize unstable cheap resources on the cloud to significantly reduce costs. We demonstrate the scalability of the framework on running pre-processing, distributed training, hyperparameter search and large-scale inference tasks utilizing 10,000 CPU cores and 300 GPU instances with the overall processing power of 30 petaflops.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2018

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Most Deep Reinforcement Learning (Deep RL) algorithms require a prohibit...
research
08/30/2022

Analysis of Distributed Deep Learning in the Cloud

We aim to resolve this problem by introducing a comprehensive distribute...
research
08/13/2021

Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

Cloud computing provides a powerful yet low-cost environment for distrib...
research
06/05/2023

How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study

Training deep learning models in the cloud or on dedicated hardware is e...
research
04/29/2021

Distributed Multigrid Neural Solvers on Megavoxel Domains

We consider the distributed training of large-scale neural networks that...
research
06/19/2021

A Generic Distributed Clustering Framework for Massive Data

In this paper, we introduce a novel Generic distributEd clustEring frame...
research
04/17/2018

Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows

Distributed computing platforms provide a robust mechanism to perform la...

Please sign up or login with your details

Forgot password? Click here to reset