[mpiwg-ft] Madrid Report

Christian Engelmann engelmannc at computer.org
Wed Sep 25 13:30:28 CDT 2013

Bronis, the problem is that the paragraph you pointed out (and quoted below for reference) has been part of MPI since version 1.1. MPI was simply not conceived with fault tolerance or fault awareness in mind. Moreover, MPI was designed with a simplistic fault model, i.e., the state after any error is undefined. Rectifying this almost 20 years later is a difficult task.

I agree that the proposed solution is not easy to comprehend from an application or library developer's point of view. I do think that fault tolerant applications that demonstrate the capabilities and usefulness of the proposed enhancements would help a lot.

I also think that there is a general misunderstanding on how this interface is supposed to be used. Looping around every MPI call to catch and recover from potential errors is certainly a quite nonsensical approach. Instead, the more practical approach is transaction- or exception-based programming. This requires programming templates atop the proposed MPI enhancements.

Once again, demonstrations using fault tolerant applications would really help. Martin already pointed to some. I know of other work by UT (Jack's group), UCR (Zizhong Chen), and UoH (Edgar Gabriel), all based on UT's FT-MPI from 2003. Moving those to the proposed MPI enhancements would help greatly.

The only failure recovery function proposed is MPI_Comm_shrink(). Removing this, simply means that only point-to-point communication will work after a process fault, but not collectives. A point-to-point only MPI is pretty useless in my opinion.


On Sep 25, 2013, at 12:51 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:

>  This document does not specify the state of a computation
>  after an erroneous MPI call has occurred. The desired
>  behavior is that a relevant error code be returned, and
>  the effect of the error be localized to the greatest
>  possible extent.


Christian Engelmann, Ph.D.

System Software Team Task Lead / R&D Staff Scientist
Computer Science Research Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory

Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
e-Mail: engelmannc at ornl.gov / Home: www.christian-engelmann.info

More information about the mpiwg-ft mailing list