ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

05/11/2021
by   Federica Filippini, et al.
0

Artificial Intelligence (AI) and Deep Learning (DL) algorithms are currently applied to a wide range of products and solutions. DL training jobs are highly resource demanding and they experience great benefits when exploiting AI accelerators (e.g., GPUs). However, the effective management of GPU-powered clusters comes with great challenges. Among these, efficient scheduling and resource allocation solutions are crucial to maximize performance and minimize Data Centers operational costs. In this paper we propose ANDREAS, an advanced scheduling solution that tackles these problems jointly, aiming at optimizing DL training runtime workloads and their energy consumption in accelerated clusters. Experiments based on simulation demostrate that we can achieve a cost reduction between 30 and 62 while the validation on a real cluster shows a worst case deviation below 13 between actual and predicted costs, proving the effectiveness of ANDREAS solution in practical scenarios.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2023

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters ...
research
05/24/2022

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The...
research
07/16/2022

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning a...
research
11/14/2022

Understanding the Energy Consumption of HPC Scale Artificial Intelligence

This paper contributes towards better understanding the energy consumpti...
research
04/17/2023

Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach

Recent breakthroughs in generative artificial intelligence have triggere...
research
02/21/2021

Customized Slicing for 6G: Enforcing Artificial Intelligence on Resource Management

Next generation wireless networks are expected to support diverse vertic...
research
10/03/2022

Green Learning: Introduction, Examples and Outlook

Rapid advances in artificial intelligence (AI) in the last decade have l...

Please sign up or login with your details

Forgot password? Click here to reset