Availability of a distributed computer system with failures

Gelenbe, Erol; Finkel, David; Tripathi, Satish K.

doi:10.1007/BF00264311

Availability of a distributed computer system with failures

Published: November 1986

Volume 23, pages 643–655, (1986)
Cite this article

Acta Informatica Aims and scope Submit manuscript

Erol Gelenbe¹,
David Finkel² &
Satish K. Tripathi³

46 Accesses
32 Citations
Explore all metrics

Summary

A model for distributed systems with failing components is presented. Each node may fail and during its recovery the load is distributed to other nodes that are up. The model assumes periodic checkpointing for error recovery and testing of the status of other nodes for the distribution of load. We consider the availability of a node, which is the proportion of time a node is available for processing, as the performance measure. A methodology for optimizing the availability of a node with respect to the checkpointing and testing intervals is given. A decomposition approach that uses the steady-state flow balance condition to estimate the load at a node is proposed. Numerical examples are presented to demonstrate the usefulness of the technique. For the case in which all nodes are identical, closed form solutions are obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Availability analysis of a distributed system with homogeneity in client and server under four different maintenance options

Article 11 November 2021

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Article 18 August 2023

Reliability assessment of multi-computer system consisting n clients and the k-out-of-n: G operation scheme with copula repair policy

Article 05 May 2022

References

Baccelli, F.: Analysis of a service facility with periodic checkpointing. Acta Inf. 15, 67–81 (1981)
Google Scholar
Bouchet, P.: Procédures de reprise dans les systèmes de gestion de base de données réparties. Acta Inf. 11, 305–340 (1979)
Google Scholar
Chandy, K.M., Ramamoorthy, C.V.: Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 6, 546–556 (1972)
Google Scholar
Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 5, 40–47 (1975)
Google Scholar
Chandy, K.M., Browne, J.C., Dissly, C.W., Uhrig, W.R.: Analytical models for rollback and recovery strategies in data base systems. IEEE Trans. Software Eng. 1, 100–110 (1975)
Google Scholar
Gelenbe, E., Derochette, D.: Performance of rollback recovery systems under intermittent failures. Commun. ACM 21, 493–499 (1978)
Google Scholar
Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26, 259–270 (1979)
Google Scholar
Krisna, C.M., Shin, K.G., Lee, Y.-H.: Optimization criteria for checkpoint placement, Commun. ACM 27, 1008–1012 (1984)
Google Scholar
Tripathi, S.K., Finkel, D., Gelenbe, E.: Load Sharing in Distributed Systems with Failures. ISEM Research Report no. 30, Université de Paris-Sud 1985

Download references

Author information

Authors and Affiliations

Université de Paris-Sud, 91405, Orsay, France
Erol Gelenbe
Department of Mathematics, Bucknell University, Lewisburg, PA, USA
David Finkel
Systems Design and Analysis Group, Department of Computer Science, University of Maryland, College Park, MD, USA
Satish K. Tripathi

Authors

Erol Gelenbe
View author publications
You can also search for this author in PubMed Google Scholar
David Finkel
View author publications
You can also search for this author in PubMed Google Scholar
Satish K. Tripathi
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

This research was performed while David Finkel and Satish Tripathi were visiting ISEM. Satish Tripathi's research was supported in part by grants from NSF (grant no. DCR-84-05235) and NASA (grant no. NAG5-235), and by Université de Paris-Sud

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gelenbe, E., Finkel, D. & Tripathi, S.K. Availability of a distributed computer system with failures. Acta Informatica 23, 643–655 (1986). https://doi.org/10.1007/BF00264311

Download citation

Received: 10 April 1986
Issue Date: November 1986
DOI: https://doi.org/10.1007/BF00264311

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Availability of a distributed computer system with failures

Summary

Access this article

Similar content being viewed by others

Availability analysis of a distributed system with homogeneity in client and server under four different maintenance options

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Reliability assessment of multi-computer system consisting n clients and the k-out-of-n: G operation scheme with copula repair policy

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Availability of a distributed computer system with failures

Summary

Access this article

Similar content being viewed by others

Availability analysis of a distributed system with homogeneity in client and server under four different maintenance options

The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors

Reliability assessment of multi-computer system consisting n clients and the k-out-of-n: G operation scheme with copula repair policy

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation