Virtual-machine-based heterogeneous checkpointing



Checkpointing an application is the act of saving the application's state during its execution on stable storage, so that if the application fails it can be restarted from the last saved state, thereby avoiding loss of the work that was already done. A heterogeneous checkpoint/restart mechanism allows one to restart an application on a possibly different hardware architecture and/or operating system than those in which the application was saved. This paper explores how to construct such a mechanism at the virtual machine level. That is, rather than dumping the entire state of the application process, the mechanism reported here dumps the state of the application as maintained by a virtual machine. During restart, the saved state is loaded into a new copy of the virtual machine, which continues running from there. The heterogeneous checkpoint/restart mechanism reported here was developed for the OCaml variant of ML. The paper reports on the main issues encountered in building such a mechanism and the design choices made, presents performance evaluations, and discusses some lessons and ideas for extending the work to native code OCaml and Java. Copyright © 2002 John Wiley & Sons, Ltd.