Active optimistic and distributed message logging for message-passing applications


Thomas Ropars, INRIA, Centre Rennes-Bretagne Atlantique, Campus universitaire de Beaulieu, 35402 Rennes, France.



Message logging is an attractive solution to provide fault tolerance for message-passing applications because it is more scalable than coordinated checkpointing. Sender-based message logging is a well-known optimization that allows the saving of message payload in the sender memory. Thus, only message reception events have to be logged reliably by using an event logger. This paper proposes solutions to further improve message logging protocol scalability. In existing works on message logging, the event logger has always been considered as a centralized process. We propose a distributed event logger that takes advantage of multi-core processors that are to be executed in parallel with application processes, leveraging the volatile memory of the nodes to save events reliably. We also propose the combination of our distributed event logger and O2P, an active optimistic message logging protocol using a gossip-based protocol to disseminate information on new stable events. Our distributed event logger and O2P are implemented in the Open MPI library. Our results show the following: (i) distributed event logging improves message logging protocol scalability and (ii) using O2P with a distributed event logger provides an efficient and scalable fault-tolerant solution for message-passing applications. Copyright © 2011 John Wiley & Sons, Ltd.