Computational Sciences and Mathematics Research Department


Scalable Fault Tolerance for Distributed Computing

Description

Large-scale ASCI codes such as ALEGRA, Xyce, and the SIERRA applications are expected to run for days or even weeks on thousands of processors in order to solve the problems of interest to the ASCI program. Because of the large number of processors, the complexity of the computing platforms, and the long execution times, it is reasonable to expect at least one failure to occur during any single run. More specifically, existing statistics indicate that failures can be as frequent as one every two days on 2000 processors. This number becomes progressively worse as the number of processors increases. Traditionally, the applications of interest have not been equipped to efficiently deal with failures, so when one occurs, the application crashes. While many applications write restart files to allow for warm starting after a failure, this is cumbersome and incurs an unacceptable amount of overhead. As a result, there has been movement in both the research and applications communities toward more efficient and more sophisticated strategies for fault tolerance.

In response to the growing interest in scalable techniques for fault tolerance, the Computer Science Research Institute (CSRI) at Sandia National Laboratories hosted a workshop that brought together Sandians interested in fault tolerance and external experts on fault tolerance. The primary goals of the workshop were to:

  • educate external experts about Sandia problems,
  • raise Sandia awareness of state-of-the-art fault tolerance technology, and
  • identify gaps between Sandia's needs and current research trends.
In order to accomplish these goals, the agenda included presentations on ASCI applications and platforms, failure detection, communication infrastructures, logging/checkpointing, and compiler technology and one formal discussion session. Presentations and other useful links can be found below.

Workshop Presentations

  • Fault Tolerance in Electrical Circuit Simulation at Sandia (PDF file)
    Robert J. Hoekstra, Sandia National Laboratories
     
  • The ALEGRA FEM Code: Overview and Thoughts on Fault Tolerance (PDF file)
    Richard R. Drake, Sandia National Laboratories
     
  • The SIERRA Framework for Massively Parallel Multiphysics Applications (PDF file)
    H. Carter Edwards, Sandia National Laboratories
     
  • Reliability Issues on ASCI Bluemountain and ASCI Q (PDF file)
    Laura A. Davey and Georgia A. Pedicini, Los Alamos National Laboratory
     
  • A Gossip-Based Service for Failure Detection and Resource Management in Heterogeneous Distributed Systems (PDF file)
    Alan D. George, University of Florida
     
  • Reliable Application Execution in LAM/MPI (PDF file)
    Andrew Lumsdaine, Indiana University
     
  • MPI/FT: A Model-Based Approach for Low-Overhead Fault Tolerance (PDF file)
    Anthony Skjellum, MPI Software Technology, Inc.
     
  • Fault Tolerance in MPI Programs (PDF file)
    Bill Gropp and Rusty Lusk, Argonne National Laboratory
    (presented by Ronald B. Brightwell, Sandia National Laboratories)
     
  • Fault Tolerance in the Network Storage Stack (PDF file)
    James S. Plank, University of Tennessee
     
  • Language-Based Fault Tolerance for Parallel Applications (PDF file)
    Daniel Marques, Cornell University
     
  • Fault Tolerance? No Problemo (PDF file)
    Jeffrey Napper, The University of Texas at Austin
     
  • Discussion Summary
    coming soon...  

Related Links

Contact

For more information, please contact Patty Hough (pdhough@ca.sandia.gov).

 

CSMR Department Projects at Sandia National Labs in California.
Copyright © 2001, Sandia Corp. All rights reserved.
Comments: tgkolda@sandia.gov.
Acknowledgments and Disclaimer.