Estimating Silent Data Corruption Rates Using a Two-Level Model

04/28/2020
by   Siva Kumar Sastry Hari, et al.
0

High-performance and safety-critical system architects must accurately evaluate the application-level silent data corruption (SDC) rates of processors to soft errors. Such an evaluation requires error propagation all the way from particle strikes on low-level state up to the program output. Existing approaches that rely on low-level simulations with fault injection cannot evaluate full applications because of their slow speeds, while application-level accelerated fault testing in accelerated particle beams is often impractical. We present a new two-level methodology for application resilience evaluation that overcomes these challenges. The proposed approach decomposes application failure rate estimation into (1) identifying how particle strikes in low-level unprotected state manifest at the architecture-level, and (2) measuring how such architecture-level manifestations propagate to the program output. We demonstrate the effectiveness of this approach on GPU architectures. We also show that using just one of the two steps can overestimate SDC rates and produce different trends—the composition of the two is needed for accurate reliability modeling.

READ FULL TEXT

page 6

page 9

research
06/04/2022

Fast and Accurate Error Simulation for CNNs against Soft Errors

The great quest for adopting AI-based computation for safety-/mission-cr...
research
11/23/2022

Characterizing a Neutron-Induced Fault Model for Deep Neural Networks

The reliability evaluation of Deep Neural Networks (DNNs) executed on Gr...
research
11/05/2019

Soft Error Resilience and Failure Recovery for Continuum Dynamics Applications

The persistently growing resilience concerns of large-scale computing sy...
research
12/07/2018

PARIS: Predicting Application Resilience Using Machine Learning

Extreme-scale scientific applications can be more vulnerable to soft err...
research
06/17/2022

Experimental evaluation of neutron-induced errors on a multicore RISC-V platform

RISC-V architectures have gained importance in the last years due to the...
research
05/03/2020

Behind the Last Line of Defense – Surviving SoC Faults and Intrusions

Today, leveraging the enormous modular power, diversity and flexibility ...
research
10/17/2021

Characterizing and Improving the Resilience of Accelerators in Autonomous Robots

Motion planning is a computationally intensive and well-studied problem ...

Please sign up or login with your details

Forgot password? Click here to reset