MOARD: Modeling Application Resilience to Transient Faults on Data Objects

02/13/2021
by   Luanzheng Guo, et al.
0

Understanding application resilience (or error tolerance) in the presence of hardware transient faults on data objects is critical to ensure computing integrity and enable efficient application-level fault tolerance mechanisms. However, we lack a method and a tool to quantify application resilience to transient faults on data objects. The traditional method, random fault injection, cannot help, because of losing data semantics and insufficient information on how and where errors are tolerated. In this paper, we introduce a method and a tool (called MOARD) to model and quantify application resilience to transient faults on data objects. Our method is based on systematically quantifying error masking events caused by application-inherent semantics and program constructs. We use MOARD to study how and why errors in data objects can be tolerated by the application. We demonstrate tangible benefits of using MOARD to direct a fault tolerance mechanism to protect data objects.

READ FULL TEXT
research
03/30/2020

Ranger: Boosting Error Resilience of Deep Neural Networks through Range Restriction

With the emerging adoption of deep neural networks (DNNs) in the HPC dom...
research
12/07/2018

PARIS: Predicting Application Resilience Using Machine Learning

Extreme-scale scientific applications can be more vulnerable to soft err...
research
03/14/2023

ISimDL: Importance Sampling-Driven Acceleration of Fault Injection Simulations for Evaluating the Robustness of Deep Learning

Deep Learning (DL) systems have proliferated in many applications, requi...
research
08/03/2018

Characterization and Comparison of Application Resilience for Serial and Parallel Executions

Soft error of exascale application is a challenge problem in modern HPC....
research
03/24/2023

On the Susceptibility of QDI Circuits to Transient Faults

By design, quasi delay-insensitive (QDI) circuits exhibit higher resilie...
research
03/04/2021

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

Graphics Processing Units (GPUs) are widely used by various applications...
research
05/16/2018

Verifying Programs Under Custom Application-Specific Execution Models

Researchers have recently designed a number of application-specific faul...

Please sign up or login with your details

Forgot password? Click here to reset