RepTFD: Replay Based Transient Fault Detection

by   Lei Li, et al.

The advances in IC process make future chip multiprocessors (CMPs) more and more vulnerable to transient faults. To detect transient faults, previous core-level schemes provide redundancy for each core separately. As a result, they may leave transient faults in the uncore parts, which consume over 50 area of a modern CMP, escaped from detection. This paper proposes RepTFD, the first core-level transient fault detection scheme with 100 of providing redundancy for each core separately, RepTFD provides redundancy for a group of cores as a whole. To be specific, it replays the execution of the checked group of cores on a redundant group of cores. Through comparing the execution results between the two groups of cores, all malignant transient faults can be caught. Moreover, RepTFD adopts a novel pending period based record-replay approach, which can greatly reduce the number of execution orders that need to be enforced in the replay-run. Hence, RepTFD brings only 4.76 performance overhead in comparison to the normal execution without fault-tolerance according to our experiments on the RTL design of an industrial CMP named Godson-3. In addition, RepTFD only consumes about 0.83 Godson-3, while needing only trivial modifications to existing components of Godson-3.


page 14

page 17

page 18

page 19


Enhancement in Reliability for Multi-core system consisting of One Instruction Cores

Rapid CMOS device size reduction resulted in billions of transistors on ...

On-Demand Redundancy Grouping: Selectable Soft-Error Tolerance for a Multicore Cluster

With the shrinking of technology nodes and the use of parallel processor...

FT-EALU: Fault Tolerant Arithmetic and Logic Unit for Critical Embedded and Real time Systems

In this paper, a fault-tolerant approach to mitigate transient and perma...

SafeLS: Toward Building a Lockstep NOEL-V Core

Safety-critical systems such as those in automotive, avionics and space,...

Parity-Based Concurrent Error Detection Schemes for the ChaCha Stream Cipher

We propose two parity-based concurrent error detection schemes for the Q...

ZOFI: Zero-Overhead Fault Injection Tool for Fast Transient Fault Coverage Analysis

The experimental evaluation of fault-tolerance studies relies on tools t...

Self-stabilization Overhead: an Experimental Case Study on Coded Atomic Storage

We study the problem of privately emulating shared memory in message-pas...

Please sign up or login with your details

Forgot password? Click here to reset