Publication Details
Fault Recovery for Coarse-Grained TMR Soft-Core Processor Using Partial Reconfiguration and State Synchronization
Kotásek Zdeněk, doc. Ing., CSc.
TMR, fault recovery, state synchronization, processor, FPGA reconfiguration
SRAM FPGAs are being more commonly integrated into safety-critical systems
nowadays. These digital circuits can provide suitable platform for a fault
tolerant system implementation meeting the trade-offs between performance,
reliability, cost and hardware resources. However, SRAM technology is vulnerable
to radiation-induced faults and mainly to Single Event Upset (SEU) effect. The
SEU can cause "bitflip" faults in SRAM memory cells which may affect internal
FPGA routing (clock and reset signals), user memory (flip-flops, block RAM) and
the functionality of implemented circuits. SEU mitigation must be implemented
into the safety-critical design to achieve required system reliability in the
harsh environment. SEU mitigation strategy may combine hardware redundancy and
Partial Dynamic Reconfiguration (PDR) in order to implement error detection,
self-repair ability and fault recovery mechanism into the system. With respect to
the compromise between the system reliability and the resource overhead, various
hardware redundancy schemes can be used. The most used form is Triple Modular
Redundancy (TMR) which can be applied on different granularity levels in the
system design. Coarse-grained TMR and PDR are often combined in one
reconfigurable architecture. The time between SEU occurrence and the completion
of fault recovery become a crucial parameter because the reliability of the TMR
with one failed replica is worse than the reliability of an unprotected system.
The fault recovery process can be generally divided into three phases: 1) fault
detection, 2) fault removal by reconfiguration of a region containing replica
identified as faulty, and 3) state synchronization bringing the reconfigured
replica into the operating state consistent with other correctly operating
replicas. Combination of TMR and PDR is the approach also often addressed by
fault mitigation methods designed for soft-core processors. The processor state
is stored in internal memories and various architectural registers. After
a faulty processor replica is reconfigured, its internal registers holding the
processor state need to be synchronized with their up-to-dated copies from other
processors replicas which were correctly operating. In this paper, we propose
a fault recovery mechanism for soft-core processor NEO430 and demonstrate
a possibility to implement a fault recovery mechanism for soft-core processor
with the state synchronization logic embedded into the processor architecture and
with the non-blocking CPU execution aware of fault recovery phases.
@inproceedings{BUT164064,
author="Karel {Szurman} and Zdeněk {Kotásek}",
title="Fault Recovery for Coarse-Grained TMR Soft-Core Processor Using Partial Reconfiguration and State Synchronization",
booktitle="Proceedings of the 7th Prague Embedded Systems Workshop",
year="2019",
pages="6--7",
publisher="Faculty of Information Technology, Czech Technical University",
address="Roztoky u Prahy",
isbn="978-80-01-06607-2",
url="https://www.fit.vut.cz/research/publication/12002/"
}