Skip to main content
Log in

Checkpointing Distributed Shared Memory

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Distributed shared memory (DSM) is a very promising programming model for exploiting the parallelism of distributed memory systems, because it provides a higher level of abstraction than simple message passing. Although the nodes of standard distributed systems exhibit high crash rates only very few DSM environments have some kind of support for fault-tolerance.

In this article, we present a checkpointing mechanism for a DSM system that is efficient and portable. It offers some portability because it is built on top of MPI and uses only the services offered by MPI and a POSIX compliant local file system.

As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithm we present experimental results obtained in a cluster of workstations. We hope that our research shows that efficient, transparent and portable checkpointing is viable for DSM systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. [Brown94] L. Brown, J. Wu. “Dynamic Snooping in a Fault-Tolerant Distributed Shared Memory”, Proc. 14th Int. Conf. on Distributed Computing Systems, pp. 218–226, 1994.

  2. [Cabillic95] G. Cabillic, G. Muller, and I. Puaut. “The Performance of Consistent Checkpointing in Distributed Shared Memory Systems”, Proc. 14th Symposium on Reliable Distributed Systems, SRDS-14, 1995.

  3. [Cater91] J. Carter, J. Bennet, and W. Zwaenepoel. “Implementation and Performance of Munin”, Proc. 13th ACM Symposium on Operating Systems Principles, pp. 152–164, 1991.

  4. [Cater93] J. B. Carter, A. Cox, S. Dwarkadas, E. N. Elnozahi, D. Johnson, P. Keleher, S. Rodrigues, W. Yu, and W. Zwaenepoel. “Network Multicomputer Using Recoverable Distributed Shared Memory”, Proc. COMPCON'93, 1993.

  5. [Choy95] M. Choy, H. Leong, and M. H. Wong. “On Distributed Object Checkpointing and Recovery”, Proc. of the ACM Principles of Distributed Computing, PODC95, 1995.

  6. [Costa96] M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro “Lightweight Logging for Lazy Release Consistency Consistent Distributed Shared Memory”, Proc. 2nd Usenix Symposium on Operating Systems Design and Implementation, pp 59–73, Seattle, October 1996.

  7. [Elnozahi92] E. N. Elnozahi, D. B. Johnson, and W. Zwaenepoel. “The Performance of Consistent Checkpointing”, Proc. 11th Symp. on Reliable Distributed Systems, pp. 39–47, 1992.

  8. [Eskicioglu95] M. R. Eskicioglu. “A Comprehensive Bibliography of Distributed Shared Memory”, Technical Report TR-95-01, Univ. of New Orleans, May 1995.

  9. [Janakiraman94] G. Janakiramam and Y. Tamir. “Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers”, Proceedings of the 13th Symposium on Reliable Distributed Systems, SRDS-13, pp. 41-51, October 1994.

    Google Scholar 

  10. [Janssens93] B. Janssens and W. K. Fuchs. “Relaxing Consistency in Recoverable Distributed Shared Memory”, Proceedings 23rd Fault-Tolerant Computing Symposium, FTCS-23, pp. 155–163, June 1993.

    Google Scholar 

  11. [Johnson95] K. L. Johnson, M. F. Kaashoek, and D. Wallach. “CRL: High-Performance All-Software Distributed Shared Memory”, Proc. of the 15th Symposium on Operating Systems Principles, 1995.

  12. [Kaashoek92] M. F. Kaashoek, R. Michiels, H. Bal, and A. Tanenbaum. “Transparent Fault-Tolerance in Parallel Orca Programs”, Proc. Symposium on Experiences with Distributed and Multiprocessor Systems III, pp. 297–312, 1992.

  13. [Keleher94] P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel. “TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems”, Proc. Winter 94 USENIX Conference, pp. 115–131, January 94.

  14. [Kermarrec95] A. M. Kermarrek, G. Cabillic, A. Gefflaut, and C. Morin, I. Puaut. “A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability”, Proc. 25th Fault-Tolerant Computing Symposium, FTCS-25, pp. 289–298, July 1995.

    Google Scholar 

  15. [Kranz93] D. Kranz, K. Johnson, and A. Agarwal. “Integrating Message-Passing and Shared-Memory: Early Experience”, Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 54–63, May 1993.

  16. [Lenoski90] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Henessy. “The Directory-based Cache Coherence Protocol for the DASH Multiprocessor”, Proc. 17th Annual International Symposium on Computer Architecture, pp. 148–159, 1990.

  17. [Li89] K. Li and P. Hudak. “Memory Coherence in Shared Virtual Memory Systems”, ACMTransactions on Computer Systems, Vol. 7, No. 4, pp. 321–359, November 1989.

    Google Scholar 

  18. [Long95] D. Long, A. Muir, and R. Golding. “A Longitudinal Survey of Internet Host Reliability”, Proc. 14th Symposium on Reliable Distributed Systems, pp. 2–9, September 1995.

  19. [MPI94] MPI Forum. MPI Forum. “A Message Passing Interface Standard”. May 1994, Available on netlib.

  20. [Neves94] N. Neves, M. Castro, and P. Guedes. “A Checkpoint Protocol for an Entry Consistent Shared Memory System”, Proc. of the 13th ACM Symposium on Principles of Distributed Computing, 1994.

  21. [Nitzberg91] B. Nitzberg and V. Lo. “Distributed Shared Memory: A Survey of Issues and Algorithms”, IEEE Computer, Vol. 24 (8), pp. 52–60, 1991.

    Google Scholar 

  22. [ORNL95] data available in: http://www.ccs.ornl.gov/

  23. [Plank94] J. Plank and K. Li. “Performance Results of ickp--A Consistent Checkpointer on the iPSC/ 860”, Proc. of the Scalable High-Performance Computing Conference, Knoxville USA, pp. 686–693, 1994.

  24. [Plank95] J. Plank, M. Beck, G. Kingsley, and K. Li. “Libckpt: Transparent Checkpointing Under Unix”, Usenix Winter 1995 Technical Conference, January 1995.

  25. [Raina92] S. Raina. “Virtual Shared Memory: A Survey of Techniques and Systems”, Technical Report CSTR-92-36, University of Bristol, December 1992.

  26. [Richard93] G. Richard III and M. Singhal. “Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory”, Proceedings 12th Symposium on Reliable Distributed Systems, SRDS-12, pp. 58–67, October 1993.

    Google Scholar 

  27. [Silva92] L. M. Silva and J. G. Silva. “Global Checkpoints for Distributed Programs”, Proc. 11th Symp. on Reliable Distributed Systems, pp. 155–162, Houston USA, 1992.

  28. [Silva97] L. M. Silva, S. Chapple, and J. G. Silva. “Implementation and Performance of DSMPI”, to appear in the Scientific Programming Journal, 1997, John Wiley & Sons.

  29. [Stumm90A] M. Stumm and S. Zhou. “Algorithms Implementing Distributed Shared Memory”, IEEE Computer, Vol. 23 (5), pp. 54–64, May 1990.

    Google Scholar 

  30. [Stumm90B] M. Stumm and S. Zhou. “Fault-Tolerant Distributed Shared Memory Algorithms”, Proc. 2nd IEEE Symp. on Parallel and Distributed Computing, pp. 719–724, Dec. 1990.

  31. [Suri95] G. Suri, B. Janssens, and W. K. Fuchs. “Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory”, Proceedings 25 th Fault-Tolerant Computing Symposium, FTCS-25, pp. 279–288, June 1995.

    Google Scholar 

  32. [Wilkinson93] T. J. Wilkinson. “Implementing Fault-Tolerance in a 64-bit Distributed Operating System”, PhD Thesis City University, July 1993.

  33. [Wu89] K. L. Wu and W. K. Fuchs. “Recoverable Distributed Shared Virtual Memory: Memory Coherence and Storage Structures”, Proceedings 19th Fault-Tolerant Computing Symposium, FTCS-19, pp 520–527, 1989.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silva, L.M., Silva, J.G. Checkpointing Distributed Shared Memory. The Journal of Supercomputing 11, 137–158 (1997). https://doi.org/10.1023/A:1007959906858

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1007959906858

Navigation