ISSN:
1573-7640
Keywords:
MULTIPROCESSOR
;
DATA FLOW
;
FAULT DETECTION
;
FAULT LOCATION
;
ALGORITHMS
Source:
Springer Online Journal Archives 1860-2000
Topics:
Computer Science
Notes:
Abstract Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases.
Type of Medium:
Electronic Resource
URL:
http://dx.doi.org/10.1023/A:1018793714426