FINJ: A Fault Injection Tool for HPC Systems

07/26/2018
by   Alessio Netti, et al.
0

We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2020

A Machine Learning Approach to Online Fault Classification in HPC Systems

As High-Performance Computing (HPC) systems strive towards the exascale ...
research
10/24/2020

LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs

Error-bounded lossy compression is becoming more and more important to t...
research
01/24/2020

Efficient Fault Injection based on Dynamic HDL Slicing Technique

This work proposes a fault injection methodology where Hardware Descript...
research
06/20/2023

MRFI: An Open Source Multi-Resolution Fault Injection Framework for Neural Network Processing

To ensure resilient neural network processing on even unreliable hardwar...
research
06/22/2019

ZOFI: Zero-Overhead Fault Injection Tool for Fast Transient Fault Coverage Analysis

The experimental evaluation of fault-tolerance studies relies on tools t...
research
10/26/2018

Online Fault Classification in HPC Systems through Machine Learning

As High-Performance Computing (HPC) systems strive towards exascale goal...
research
08/03/2018

Characterization and Comparison of Application Resilience for Serial and Parallel Executions

Soft error of exascale application is a challenge problem in modern HPC....

Please sign up or login with your details

Forgot password? Click here to reset