Scalable Fault Tolerance for Distributed ComputingDescriptionAs the size and complexity of tera-scale distributed computing platforms continue to grow, so does the probability that a failure will occur during the execution of a long-running simulation. Applications running on these systems are not currently equipped to deal with failures efficiently, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While many programmers make use of checkpoints to allow for restarting their applications, this is cumbersome and incurs substantial overhead, especially as the number of processors and the size of the problem become large. In order to explore new methods for scalable fault tolerance in large-scale distributed applications, Sandia's Computer Science Research Institute (CSRI) sponsored a workshop on fault tolerance, held at the California site. The workshop featured a series of technical presentations by leading researchers in the field and by Sandia scientists, as well as numerous discussions on new and exciting research directions. Workshop Attendees
Workshop Presentations
Related Links
ContactFor more information, please contact Patty Hough (pdhough@ca.sandia.gov). |
CSMR Department Projects
at Sandia National Labs in
California.
Copyright © 2001, Sandia Corp. All rights reserved.
Comments: tgkolda@sandia.gov.
Acknowledgments and Disclaimer.