Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

11/05/2019
by   Li Tan, et al.
0

The persistently growing resilience concerns of large-scale computing systems today require not only generic fault tolerance approaches, but also application-level resilience, due to demanding efficiency and various domain-specific requirements. Scientific applications within a particular domain generally comply with domain conservation laws, which can be leveraged as an error detection criterion to study the resilience of this domain of applications sharing similar program characteristics. However, it is challenging to achieve application resilience: (a) how to identify the invariants of a given domain of applications, knowing the conservation laws, and (b) how to utilize the invariants to efficiently detect and recover from failures in application runs. In this work, we target several continuum dynamics software packages, FleCSALE [1] and CODY [2] (with intrinsic invariants during computation), study their resilience to soft errors online (injected using an open-source fault injector), and investigate the opportunities for non-intrusive and lightweight failure recovery (checksum-based invariant checking). We propose a checksum-retry approach to achieve our goals, and experimental results on a virtualized platform with extensive fault injection campaigns demonstrate the effectiveness and efficiency of the proposed approach.

READ FULL TEXT

page 13

page 14

research
11/05/2019

Failure Analysis and Quantification for Contemporary and Future Supercomputers

Large-scale computing systems today are assembled by numerous computing ...
research
12/07/2018

PARIS: Predicting Application Resilience Using Machine Learning

Extreme-scale scientific applications can be more vulnerable to soft err...
research
04/17/2018

Adaptive control in rollforward recovery for extreme scale multigrid

With the increasing number of compute components, failures in future exa...
research
04/28/2020

Estimating Silent Data Corruption Rates Using a Two-Level Model

High-performance and safety-critical system architects must accurately e...
research
02/18/2022

Lightweight Soft Error Resilience for In-Order Cores

Acoustic-sensor-based soft error resilience is particularly promising, s...
research
10/26/2020

Resiliency in Numerical Algorithm Design for Extreme Scale Simulations

This work is based on the seminar titled “Resiliency in Numerical Algori...
research
09/11/2021

MODC: Resilience for disaggregated memory architectures using task-based programming

Disaggregated memory architectures provide benefits to applications beyo...

Please sign up or login with your details

Forgot password? Click here to reset