Computational Sciences and Mathematics Research Department


Scalable Fault Tolerance for Distributed Computing

Description

As the size and complexity of tera-scale distributed computing platforms continue to grow, so does the probability that a failure will occur during the execution of a long-running simulation. Applications running on these systems are not currently equipped to deal with failures efficiently, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While many programmers make use of checkpoints to allow for restarting their applications, this is cumbersome and incurs substantial overhead, especially as the number of processors and the size of the problem become large.

In order to explore new methods for scalable fault tolerance in large-scale distributed applications, Sandia's Computer Science Research Institute (CSRI) sponsored a workshop on fault tolerance, held at the California site. The workshop featured a series of technical presentations by leading researchers in the field and by Sandia scientists, as well as numerous discussions on new and exciting research directions.

Workshop Attendees

Workshop Presentations

  • Experimental Analysis of Flat and Layered Gossip Services for Scalable Distributed Failure Detection and Consensus ( PDF file)
    Alan George, The University of Florida
     
  • Lightweight Fault-Tolerance ( PDF file)
    Lorenzo Alvisi, The University of Texas at Austin
     
  • Transparent Fault-Tolerance: Mechanisms and Design Principles ( PDF file)
    Thomas Bressoud, Bell Laboratories
     
  • Cplant Overview ( PDF file)
    Lee Ward, Sandia National Laboratories
     
  • Fault Tolerance for Quantum Chemistry Methods ( PDF file)
    Curtis Janssen, Sandia National Laboratories
     
  • Fault Tolerance in APPSPACK: Asynchronous Parallel Pattern Search for Derivative-Free Optimization ( PDF file)
    Tamara Kolda, Sandia National Laboratories

Related Links

Contact

For more information, please contact Patty Hough (pdhough@ca.sandia.gov).

 

CSMR Department Projects at Sandia National Labs in California.
Copyright © 2001, Sandia Corp. All rights reserved.
Comments: tgkolda@sandia.gov.
Acknowledgments and Disclaimer.