Abstract
Distributed shared memory (DSM) is a very promising programming model for exploiting the parallelism of distributed memory systems, because it provides a higher level of abstraction than simple message passing. Although the nodes of standard distributed systems exhibit high crash rates only very few DSM environments have some kind of support for fault-tolerance.
In this article, we present a checkpointing mechanism for a DSM system that is efficient and portable. It offers some portability because it is built on top of MPI and uses only the services offered by MPI and a POSIX compliant local file system.
As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithm we present experimental results obtained in a cluster of workstations. We hope that our research shows that efficient, transparent and portable checkpointing is viable for DSM systems.
Similar content being viewed by others
References
[Brown94] L. Brown, J. Wu. “Dynamic Snooping in a Fault-Tolerant Distributed Shared Memory”, Proc. 14th Int. Conf. on Distributed Computing Systems, pp. 218–226, 1994.
[Cabillic95] G. Cabillic, G. Muller, and I. Puaut. “The Performance of Consistent Checkpointing in Distributed Shared Memory Systems”, Proc. 14th Symposium on Reliable Distributed Systems, SRDS-14, 1995.
[Cater91] J. Carter, J. Bennet, and W. Zwaenepoel. “Implementation and Performance of Munin”, Proc. 13th ACM Symposium on Operating Systems Principles, pp. 152–164, 1991.
[Cater93] J. B. Carter, A. Cox, S. Dwarkadas, E. N. Elnozahi, D. Johnson, P. Keleher, S. Rodrigues, W. Yu, and W. Zwaenepoel. “Network Multicomputer Using Recoverable Distributed Shared Memory”, Proc. COMPCON'93, 1993.
[Choy95] M. Choy, H. Leong, and M. H. Wong. “On Distributed Object Checkpointing and Recovery”, Proc. of the ACM Principles of Distributed Computing, PODC95, 1995.
[Costa96] M. Costa, P. Guedes, M. Sequeira, N. Neves, and M. Castro “Lightweight Logging for Lazy Release Consistency Consistent Distributed Shared Memory”, Proc. 2nd Usenix Symposium on Operating Systems Design and Implementation, pp 59–73, Seattle, October 1996.
[Elnozahi92] E. N. Elnozahi, D. B. Johnson, and W. Zwaenepoel. “The Performance of Consistent Checkpointing”, Proc. 11th Symp. on Reliable Distributed Systems, pp. 39–47, 1992.
[Eskicioglu95] M. R. Eskicioglu. “A Comprehensive Bibliography of Distributed Shared Memory”, Technical Report TR-95-01, Univ. of New Orleans, May 1995.
[Janakiraman94] G. Janakiramam and Y. Tamir. “Coordinated Checkpointing-Rollback Error Recovery for Distributed Shared Memory Multicomputers”, Proceedings of the 13th Symposium on Reliable Distributed Systems, SRDS-13, pp. 41-51, October 1994.
[Janssens93] B. Janssens and W. K. Fuchs. “Relaxing Consistency in Recoverable Distributed Shared Memory”, Proceedings 23rd Fault-Tolerant Computing Symposium, FTCS-23, pp. 155–163, June 1993.
[Johnson95] K. L. Johnson, M. F. Kaashoek, and D. Wallach. “CRL: High-Performance All-Software Distributed Shared Memory”, Proc. of the 15th Symposium on Operating Systems Principles, 1995.
[Kaashoek92] M. F. Kaashoek, R. Michiels, H. Bal, and A. Tanenbaum. “Transparent Fault-Tolerance in Parallel Orca Programs”, Proc. Symposium on Experiences with Distributed and Multiprocessor Systems III, pp. 297–312, 1992.
[Keleher94] P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel. “TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems”, Proc. Winter 94 USENIX Conference, pp. 115–131, January 94.
[Kermarrec95] A. M. Kermarrek, G. Cabillic, A. Gefflaut, and C. Morin, I. Puaut. “A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability”, Proc. 25th Fault-Tolerant Computing Symposium, FTCS-25, pp. 289–298, July 1995.
[Kranz93] D. Kranz, K. Johnson, and A. Agarwal. “Integrating Message-Passing and Shared-Memory: Early Experience”, Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 54–63, May 1993.
[Lenoski90] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Henessy. “The Directory-based Cache Coherence Protocol for the DASH Multiprocessor”, Proc. 17th Annual International Symposium on Computer Architecture, pp. 148–159, 1990.
[Li89] K. Li and P. Hudak. “Memory Coherence in Shared Virtual Memory Systems”, ACMTransactions on Computer Systems, Vol. 7, No. 4, pp. 321–359, November 1989.
[Long95] D. Long, A. Muir, and R. Golding. “A Longitudinal Survey of Internet Host Reliability”, Proc. 14th Symposium on Reliable Distributed Systems, pp. 2–9, September 1995.
[MPI94] MPI Forum. MPI Forum. “A Message Passing Interface Standard”. May 1994, Available on netlib.
[Neves94] N. Neves, M. Castro, and P. Guedes. “A Checkpoint Protocol for an Entry Consistent Shared Memory System”, Proc. of the 13th ACM Symposium on Principles of Distributed Computing, 1994.
[Nitzberg91] B. Nitzberg and V. Lo. “Distributed Shared Memory: A Survey of Issues and Algorithms”, IEEE Computer, Vol. 24 (8), pp. 52–60, 1991.
[ORNL95] data available in: http://www.ccs.ornl.gov/
[Plank94] J. Plank and K. Li. “Performance Results of ickp--A Consistent Checkpointer on the iPSC/ 860”, Proc. of the Scalable High-Performance Computing Conference, Knoxville USA, pp. 686–693, 1994.
[Plank95] J. Plank, M. Beck, G. Kingsley, and K. Li. “Libckpt: Transparent Checkpointing Under Unix”, Usenix Winter 1995 Technical Conference, January 1995.
[Raina92] S. Raina. “Virtual Shared Memory: A Survey of Techniques and Systems”, Technical Report CSTR-92-36, University of Bristol, December 1992.
[Richard93] G. Richard III and M. Singhal. “Using Logging and Asynchronous Checkpointing to Implement Recoverable Distributed Shared Memory”, Proceedings 12th Symposium on Reliable Distributed Systems, SRDS-12, pp. 58–67, October 1993.
[Silva92] L. M. Silva and J. G. Silva. “Global Checkpoints for Distributed Programs”, Proc. 11th Symp. on Reliable Distributed Systems, pp. 155–162, Houston USA, 1992.
[Silva97] L. M. Silva, S. Chapple, and J. G. Silva. “Implementation and Performance of DSMPI”, to appear in the Scientific Programming Journal, 1997, John Wiley & Sons.
[Stumm90A] M. Stumm and S. Zhou. “Algorithms Implementing Distributed Shared Memory”, IEEE Computer, Vol. 23 (5), pp. 54–64, May 1990.
[Stumm90B] M. Stumm and S. Zhou. “Fault-Tolerant Distributed Shared Memory Algorithms”, Proc. 2nd IEEE Symp. on Parallel and Distributed Computing, pp. 719–724, Dec. 1990.
[Suri95] G. Suri, B. Janssens, and W. K. Fuchs. “Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory”, Proceedings 25 th Fault-Tolerant Computing Symposium, FTCS-25, pp. 279–288, June 1995.
[Wilkinson93] T. J. Wilkinson. “Implementing Fault-Tolerance in a 64-bit Distributed Operating System”, PhD Thesis City University, July 1993.
[Wu89] K. L. Wu and W. K. Fuchs. “Recoverable Distributed Shared Virtual Memory: Memory Coherence and Storage Structures”, Proceedings 19th Fault-Tolerant Computing Symposium, FTCS-19, pp 520–527, 1989.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Silva, L.M., Silva, J.G. Checkpointing Distributed Shared Memory. The Journal of Supercomputing 11, 137–158 (1997). https://doi.org/10.1023/A:1007959906858
Issue Date:
DOI: https://doi.org/10.1023/A:1007959906858