MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

04/20/2019
by   Rohan Garg, et al.
0

Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster.

READ FULL TEXT
research
12/13/2013

Transparent Checkpoint-Restart over InfiniBand

InfiniBand is widely used for low-latency, high-throughput cluster compu...
research
04/12/2018

A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

C++ advocates exceptions as the preferred way to handle unexpected behav...
research
07/27/2016

System-level Scalable Checkpoint-Restart for Petascale Computing

Fault tolerance for the upcoming exascale generation has long been an ar...
research
06/12/2019

Checkpoint/restart approaches for a thread-based MPI runtime

Fault-tolerance has always been an important topic when it comes to runn...
research
03/08/2021

Transparent Checkpointing for OpenGL Applications on GPUs

This work presents transparent checkpointing of OpenGL applications, ref...
research
03/15/2021

Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC

Checkpoint/restart (C/R) provides fault-tolerant computing capability, e...
research
12/10/2021

MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale

MANA-2.0 is a scalable, future-proof design for transparent checkpointin...

Please sign up or login with your details

Forgot password? Click here to reset