FaBSR: a method for cluster failure prediction based on Bayesian serial revision and an application to LANL cluster

Authors

  • Qiang Liu,

    Corresponding author
    1. College of Information System and Management, National University of Defense Technology, Changsha 410073, People's Republic of China
    2. School of Computer Science, McGill University, Montreal, Canada
    • College of Information System and Management, National University of Defense Technology, Changsha 410073, People's Republic of China
    Search for more papers by this author
  • Jinglun Zhou,

    1. College of Information System and Management, National University of Defense Technology, Changsha 410073, People's Republic of China
    Search for more papers by this author
  • Guang Jin,

    1. College of Information System and Management, National University of Defense Technology, Changsha 410073, People's Republic of China
    Search for more papers by this author
  • Quan Sun,

    1. College of Information System and Management, National University of Defense Technology, Changsha 410073, People's Republic of China
    2. School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, U.S.A.
    Search for more papers by this author
  • Min Xi

    1. School of Computer Science, McGill University, Montreal, Canada
    2. Department of Computer Science, Xi'an Jiaotong University, Xi'an, People's Republic of China
    Search for more papers by this author

Abstract

Accurate failure number prediction of Repairable Large-scale Long-running Computing (RLLC) cluster systems is a challenge because of the reparability and large scale of the system. Furthermore, the variational failure rate derived from system maintenance yields a small sample problem, that is, the failure numbers observed from different time phases do not belong to the same population. To address the challenge, a general Bayesian serial revision prediction method (FaBSR) is proposed on the basis of the Time Series and Bootstrap approaches, and it can determine the distribution of failure number, analyze the variation trend of failure rate and accurately predict the failure number. To demonstrate the performance gains of the method, the data of Los Alamos National Laboratory (LANL) cluster system are used as a typical RLLC system to do extensive experiments. And experimental results show that the prediction accuracy of FaBSR is 80.4%, improved by more than 4% compared with other existing methods. Copyright © 2010 John Wiley & Sons, Ltd.

Ancillary