Skip to main content
Log in

Availability of a distributed computer system with failures

  • Published:
Acta Informatica Aims and scope Submit manuscript

Summary

A model for distributed systems with failing components is presented. Each node may fail and during its recovery the load is distributed to other nodes that are up. The model assumes periodic checkpointing for error recovery and testing of the status of other nodes for the distribution of load. We consider the availability of a node, which is the proportion of time a node is available for processing, as the performance measure. A methodology for optimizing the availability of a node with respect to the checkpointing and testing intervals is given. A decomposition approach that uses the steady-state flow balance condition to estimate the load at a node is proposed. Numerical examples are presented to demonstrate the usefulness of the technique. For the case in which all nodes are identical, closed form solutions are obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baccelli, F.: Analysis of a service facility with periodic checkpointing. Acta Inf. 15, 67–81 (1981)

    Google Scholar 

  2. Bouchet, P.: Procédures de reprise dans les systèmes de gestion de base de données réparties. Acta Inf. 11, 305–340 (1979)

    Google Scholar 

  3. Chandy, K.M., Ramamoorthy, C.V.: Rollback and recovery strategies for computer programs. IEEE Trans. Comput. 6, 546–556 (1972)

    Google Scholar 

  4. Chandy, K.M.: A survey of analytic models of rollback and recovery strategies. Computer 5, 40–47 (1975)

    Google Scholar 

  5. Chandy, K.M., Browne, J.C., Dissly, C.W., Uhrig, W.R.: Analytical models for rollback and recovery strategies in data base systems. IEEE Trans. Software Eng. 1, 100–110 (1975)

    Google Scholar 

  6. Gelenbe, E., Derochette, D.: Performance of rollback recovery systems under intermittent failures. Commun. ACM 21, 493–499 (1978)

    Google Scholar 

  7. Gelenbe, E.: On the optimum checkpoint interval. J. ACM 26, 259–270 (1979)

    Google Scholar 

  8. Krisna, C.M., Shin, K.G., Lee, Y.-H.: Optimization criteria for checkpoint placement, Commun. ACM 27, 1008–1012 (1984)

    Google Scholar 

  9. Tripathi, S.K., Finkel, D., Gelenbe, E.: Load Sharing in Distributed Systems with Failures. ISEM Research Report no. 30, Université de Paris-Sud 1985

Download references

Author information

Authors and Affiliations

Authors

Additional information

This research was performed while David Finkel and Satish Tripathi were visiting ISEM. Satish Tripathi's research was supported in part by grants from NSF (grant no. DCR-84-05235) and NASA (grant no. NAG5-235), and by Université de Paris-Sud

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gelenbe, E., Finkel, D. & Tripathi, S.K. Availability of a distributed computer system with failures. Acta Informatica 23, 643–655 (1986). https://doi.org/10.1007/BF00264311

Download citation

  • Received:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00264311

Keywords

Navigation