Scalable Fault Tolerance for Distributed ComputingDescriptionLarge-scale ASCI codes such as ALEGRA, Xyce, and the SIERRA applications are expected to run for days or even weeks on thousands of processors in order to solve the problems of interest to the ASCI program. Because of the large number of processors, the complexity of the computing platforms, and the long execution times, it is reasonable to expect at least one failure to occur during any single run. More specifically, existing statistics indicate that failures can be as frequent as one every two days on 2000 processors. This number becomes progressively worse as the number of processors increases. Traditionally, the applications of interest have not been equipped to efficiently deal with failures, so when one occurs, the application crashes. While many applications write restart files to allow for warm starting after a failure, this is cumbersome and incurs an unacceptable amount of overhead. As a result, there has been movement in both the research and applications communities toward more efficient and more sophisticated strategies for fault tolerance. In response to the growing interest in scalable techniques for fault tolerance, the Computer Science Research Institute (CSRI) at Sandia National Laboratories hosted a workshop that brought together Sandians interested in fault tolerance and external experts on fault tolerance. The primary goals of the workshop were to:
Workshop Presentations
Related Links
ContactFor more information, please contact Patty Hough (pdhough@ca.sandia.gov). |
CSMR Department Projects
at Sandia National Labs in
California.
Copyright © 2001, Sandia Corp. All rights reserved.
Comments: tgkolda@sandia.gov.
Acknowledgments and Disclaimer.