Checkpointing Distributed Shared Memory

Silva, Luis M.; Silva, João Gabriel

doi:10.1023/A:1007959906858

Checkpointing Distributed Shared Memory

Published: October 1997

Volume 11, pages 137–158, (1997)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Luis M. Silva¹ &
João Gabriel Silva¹

56 Accesses
3 Citations
3 Altmetric
Explore all metrics

Abstract

Distributed shared memory (DSM) is a very promising programming model for exploiting the parallelism of distributed memory systems, because it provides a higher level of abstraction than simple message passing. Although the nodes of standard distributed systems exhibit high crash rates only very few DSM environments have some kind of support for fault-tolerance.

In this article, we present a checkpointing mechanism for a DSM system that is efficient and portable. It offers some portability because it is built on top of MPI and uses only the services offered by MPI and a POSIX compliant local file system.

As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithm we present experimental results obtained in a cluster of workstations. We hope that our research shows that efficient, transparent and portable checkpointing is viable for DSM systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

[Brown94] L. Brown, J. Wu. “Dynamic Snooping in a Fault-Tolerant Distributed Shared Memory”, Proc. 14th Int. Conf. on Distributed Computing Systems, pp. 218–226, 1994.
[Cabillic95] G. Cabillic, G. Muller, and I. Puaut. “The Performance of Consistent Checkpointing in Distributed Shared Memory Systems”, Proc. 14th Symposium on Reliable Distributed Systems, SRDS-14, 1995.
[Cater91] J. Carter, J. Bennet, and W. Zwaenepoel. “Implementation and Performance of Munin”, Proc. 13th ACM Symposium on Operating Systems Principles, pp. 152–164, 1991.
[Cater93] J. B. Carter, A. Cox, S. Dwarkadas, E. N. Elnozahi, D. Johnson, P. Keleher, S. Rodrigues, W. Yu, and W. Zwaenepoel. “Network Multicomputer Using Recoverable Distributed Shared Memory”, Proc. COMPCON'93, 1993.
[Choy95] M. Choy, H. Leong, and M. H. Wong. “On Distributed Object Checkpointing and Recovery”, Proc. of the ACM Principles of Distributed Computing, PODC95, 1995.
[Costa96] M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro “Lightweight Logging for Lazy Release Consistency Consistent Distributed Shared Memory”, Proc. 2nd Usenix Symposium on Operating Systems Design and Implementation, pp 59–73, Seattle, October 1996.
[Elnozahi92] E. N. Elnozahi, D. B. Johnson, and W. Zwaenepoel. “The Performance of Consistent Checkpointing”, Proc. 11th Symp. on Reliable Distributed Systems, pp. 39–47, 1992.
[Eskicioglu95] M. R. Eskicioglu. “A Comprehensive Bibliography of Distributed Shared Memory”, Technical Report TR-95-01, Univ. of New Orleans, May 1995.
[Janakiraman94] G. Janakiramam and Y. Tamir. “Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers”, Proceedings of the 13th Symposium on Reliable Distributed Systems, SRDS-13, pp. 41-51, October 1994.
Google Scholar
[Janssens93] B. Janssens and W. K. Fuchs. “Relaxing Consistency in Recoverable Distributed Shared Memory”, Proceedings 23rd Fault-Tolerant Computing Symposium, FTCS-23, pp. 155–163, June 1993.
Google Scholar
[Johnson95] K. L. Johnson, M. F. Kaashoek, and D. Wallach. “CRL: High-Performance All-Software Distributed Shared Memory”, Proc. of the 15th Symposium on Operating Systems Principles, 1995.
[Kaashoek92] M. F. Kaashoek, R. Michiels, H. Bal, and A. Tanenbaum. “Transparent Fault-Tolerance in Parallel Orca Programs”, Proc. Symposium on Experiences with Distributed and Multiprocessor Systems III, pp. 297–312, 1992.
[Keleher94] P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel. “TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems”, Proc. Winter 94 USENIX Conference, pp. 115–131, January 94.
[Kermarrec95] A. M. Kermarrek, G. Cabillic, A. Gefflaut, and C. Morin, I. Puaut. “A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability”, Proc. 25th Fault-Tolerant Computing Symposium, FTCS-25, pp. 289–298, July 1995.
Google Scholar
[Kranz93] D. Kranz, K. Johnson, and A. Agarwal. “Integrating Message-Passing and Shared-Memory: Early Experience”, Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 54–63, May 1993.
[Lenoski90] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Henessy. “The Directory-based Cache Coherence Protocol for the DASH Multiprocessor”, Proc. 17th Annual International Symposium on Computer Architecture, pp. 148–159, 1990.
[Li89] K. Li and P. Hudak. “Memory Coherence in Shared Virtual Memory Systems”, ACMTransactions on Computer Systems, Vol. 7, No. 4, pp. 321–359, November 1989.
Google Scholar
[Long95] D. Long, A. Muir, and R. Golding. “A Longitudinal Survey of Internet Host Reliability”, Proc. 14th Symposium on Reliable Distributed Systems, pp. 2–9, September 1995.
[MPI94] MPI Forum. MPI Forum. “A Message Passing Interface Standard”. May 1994, Available on netlib.
[Neves94] N. Neves, M. Castro, and P. Guedes. “A Checkpoint Protocol for an Entry Consistent Shared Memory System”, Proc. of the 13th ACM Symposium on Principles of Distributed Computing, 1994.
[Nitzberg91] B. Nitzberg and V. Lo. “Distributed Shared Memory: A Survey of Issues and Algorithms”, IEEE Computer, Vol. 24 (8), pp. 52–60, 1991.
Google Scholar
[ORNL95] data available in: http://www.ccs.ornl.gov/
[Plank94] J. Plank and K. Li. “Performance Results of ickp--A Consistent Checkpointer on the iPSC/ 860”, Proc. of the Scalable High-Performance Computing Conference, Knoxville USA, pp. 686–693, 1994.
[Plank95] J. Plank, M. Beck, G. Kingsley, and K. Li. “Libckpt: Transparent Checkpointing Under Unix”, Usenix Winter 1995 Technical Conference, January 1995.
[Raina92] S. Raina. “Virtual Shared Memory: A Survey of Techniques and Systems”, Technical Report CSTR-92-36, University of Bristol, December 1992.
[Richard93] G. Richard III and M. Singhal. “Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory”, Proceedings 12th Symposium on Reliable Distributed Systems, SRDS-12, pp. 58–67, October 1993.
Google Scholar
[Silva92] L. M. Silva and J. G. Silva. “Global Checkpoints for Distributed Programs”, Proc. 11th Symp. on Reliable Distributed Systems, pp. 155–162, Houston USA, 1992.
[Silva97] L. M. Silva, S. Chapple, and J. G. Silva. “Implementation and Performance of DSMPI”, to appear in the Scientific Programming Journal, 1997, John Wiley & Sons.
[Stumm90A] M. Stumm and S. Zhou. “Algorithms Implementing Distributed Shared Memory”, IEEE Computer, Vol. 23 (5), pp. 54–64, May 1990.
Google Scholar
[Stumm90B] M. Stumm and S. Zhou. “Fault-Tolerant Distributed Shared Memory Algorithms”, Proc. 2nd IEEE Symp. on Parallel and Distributed Computing, pp. 719–724, Dec. 1990.
[Suri95] G. Suri, B. Janssens, and W. K. Fuchs. “Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory”, Proceedings 25 th Fault-Tolerant Computing Symposium, FTCS-25, pp. 279–288, June 1995.
Google Scholar
[Wilkinson93] T. J. Wilkinson. “Implementing Fault-Tolerance in a 64-bit Distributed Operating System”, PhD Thesis City University, July 1993.
[Wu89] K. L. Wu and W. K. Fuchs. “Recoverable Distributed Shared Virtual Memory: Memory Coherence and Storage Structures”, Proceedings 19th Fault-Tolerant Computing Symposium, FTCS-19, pp 520–527, 1989.
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento Engenharia Informatica, Universidade de Coimbra, POLO II‐Vila Franca, P-3030, Coimbra, Portugal
Luis M. Silva & João Gabriel Silva

Authors

Luis M. Silva
View author publications
You can also search for this author in PubMed Google Scholar
João Gabriel Silva
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Silva, L.M., Silva, J.G. Checkpointing Distributed Shared Memory. The Journal of Supercomputing 11, 137–158 (1997). https://doi.org/10.1023/A:1007959906858

Download citation

Issue Date: October 1997
DOI: https://doi.org/10.1023/A:1007959906858

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Checkpointing Distributed Shared Memory

Abstract

Access this article

Similar content being viewed by others

System-Level Transparent Checkpointing for OpenSHMEM

Fast In-Memory Checkpointing with POSIX API for Legacy Exascale-Applications

Self-stabilization Overhead: A Case Study on Coded Atomic Storage

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Checkpointing Distributed Shared Memory

Abstract

Access this article

Similar content being viewed by others

System-Level Transparent Checkpointing for OpenSHMEM

Fast In-Memory Checkpointing with POSIX API for Legacy Exascale-Applications

Self-stabilization Overhead: A Case Study on Coded Atomic Storage

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation