Standard Article

Failure Detectors for Asynchronous Distributed Systems: An Introduction

  1. Michel Rennes

Published Online: 15 SEP 2008

DOI: 10.1002/9780470050118.ecse148

Wiley Encyclopedia of Computer Science and Engineering

Wiley Encyclopedia of Computer Science and Engineering

How to Cite

Rennes, M. 2008. Failure Detectors for Asynchronous Distributed Systems: An Introduction. Wiley Encyclopedia of Computer Science and Engineering. 1–11.

Author Information

  1. IRISA, Université de Rennes, Rennes Cedex, France

Publication History

  1. Published Online: 15 SEP 2008

Abstract

Since the first version of Chandra and Toueg's seminal paper titled “Unreliable failure detectors for reliable distributed systems” in 1991, the failure detector concept has been extensively studied and investigated. This is not at all surprising as failure detection is pervasive in the design, the analysis, and the implementation of a lot of fault-tolerant distributed algorithms that constitute the core of distributed system middleware.

The literature on this topic is mostly technical and appears mainly in theoretically inclined journals and conferences. The aim of this article is to offer an introductory survey to the failure detector concept for readers who are not familiar with it and want to quickly understand its aim, its basic principles, its power, and is limitations. To attain this goal, the article first describes the motivations that underlie the concept, and then it surveys several distributed computing problems showing how they can be solved with the help of an appropriate failure detector. So, this short article presents motivations, concepts, problems, definitions, and algorithms (it does not contain proofs). It is aimed at people who want to understand the basics of failure detectors.

Keywords:

  • agreement problem;
  • asynchronous distributed systems;
  • consensus;
  • failure detector;
  • fault tolerance;
  • oracle;
  • message passing;
  • process crash;
  • reliability