COUNTDOWN Slack: a Run-time Library to Reduce Energy Footprint in Large-scale MPI Applications

by   Daniele Cesarini, et al.

The power consumption of supercomputers is a major challenge for system owners, users, and society. It limits the capacity of system installations, it requires large cooling infrastructures, and it is the cause of a large carbon footprint. Reducing power during application execution without changing the application source code or increasing time-to-completion is highly desirable in real-life high-performance computing scenarios. The power management run-time frameworks proposed in the last decade are based on the assumption that the duration of communication and application phases in an MPI application can be predicted and used at run-time to trade-off communication slack with power consumption. In this manuscript, we first show that this assumption is too general and leads to mispredictions, slowing down applications, thereby jeopardizing the claimed benefits. We then propose a new approach based on (i) the separation of communication phases and slack during MPI calls and (ii) a timeout algorithm to cope with the hardware power management latency, which jointly makes it possible to achieve performance-neutral power saving in MPI applications without requiring labor-intensive and risky application source code modifications. We validate our approach in a tier-1 production environment with widely adopted scientific applications. Our approach has a time-to-completion overhead lower than 1 in communication phases to achieve an average energy saving of 10 on a large-scale application runs, the proposed approach achieves 22 saving with an overhead of only 0.4 approaches, COUNTDOWN Slack is the only that always leads to an energy saving with negligible overhead (<3


page 2

page 3

page 7

page 8

page 9

page 10

page 11

page 12


COUNTDOWN - three, two, one, low power! A Run-time Library for Energy Saving in MPI Communication Primitives

Power consumption is a looming treat in today's computing progress. In s...

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

As we enter the exascale computing era, efficiently utilizing power and ...

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

Due to the increasing size of HPC machines, the fault presence is becomi...

AITuning: Machine Learning-based Tuning Tool for Run-Time Communication Libraries

In this work, we address the problem of tuning communication libraries b...

Fine-Grained Energy Modeling for the Source Code of a Mobile Application

Energy efficiency has a significant influence on user experience of batt...

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

HPC systems keep growing in size to meet the ever-increasing demand for ...

Open-MPI over MOSIX: paralleled computing in a clustered world

Recent increased interest in Cloud computing emphasizes the need to find...

Please sign up or login with your details

Forgot password? Click here to reset