[mpiwg-ft] Madrid Report

Wed Sep 25 13:43:32 CDT 2013

Christian:

My suggestion was not to remove a functipon from the existing
proposal in order to provide a query capability. My proposal
was to re-evaluate the decision to jump to a complex solution
prior to addressing the primary issue for most users.

I am aware of the lineage of the paragraph that I cited. I
do not agree that its age implies it cannot be fixed. In
fact, no FT proposal has any value if it does not address
that paragraph. My suggestion was to take a simpler approach
that allows a user to query whether an error will persist
and, if so, does it imply other failures for various
classes of operations. Such an interface would be best
effort but would generally allow a simple "I can keep
trying (possibly ina slightly modified form until I think
the implementation does not know it is hosed" approach.
More complex approaches could follow but they would be
imprved by having such a query interface available.

Bronis

On Wed, 25 Sep 2013, Christian Engelmann wrote:

>
> Bronis, the problem is that the paragraph you pointed out (and quoted below for reference) has been part of MPI since version 1.1. MPI was simply not conceived with fault tolerance or fault awareness in mind. Moreover, MPI was designed with a simplistic fault model, i.e., the state after any error is undefined. Rectifying this almost 20 years later is a difficult task.
>
> I agree that the proposed solution is not easy to comprehend from an application or library developer's point of view. I do think that fault tolerant applications that demonstrate the capabilities and usefulness of the proposed enhancements would help a lot.
>
> I also think that there is a general misunderstanding on how this interface is supposed to be used. Looping around every MPI call to catch and recover from potential errors is certainly a quite nonsensical approach. Instead, the more practical approach is transaction- or exception-based programming. This requires programming templates atop the proposed MPI enhancements.
>
> Once again, demonstrations using fault tolerant applications would really help. Martin already pointed to some. I know of other work by UT (Jack's group), UCR (Zizhong Chen), and UoH (Edgar Gabriel), all based on UT's FT-MPI from 2003. Moving those to the proposed MPI enhancements would help greatly.
>
> The only failure recovery function proposed is MPI_Comm_shrink(). Removing this, simply means that only point-to-point communication will work after a process fault, but not collectives. A point-to-point only MPI is pretty useless in my opinion.
>
> Christian
>
> On Sep 25, 2013, at 12:51 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>
>>  This document does not specify the state of a computation
>>  after an erroneous MPI call has occurred. The desired
>>  behavior is that a relevant error code be returned, and
>>  the effect of the error be localized to the greatest
>>  possible extent.
>
> --
>
> Christian Engelmann, Ph.D.
>
> System Software Team Task Lead / R&D Staff Scientist
> Computer Science Research Group
> Computer Science and Mathematics Division
> Oak Ridge National Laboratory
>
> Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
> Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
> e-Mail: engelmannc at ornl.gov / Home: www.christian-engelmann.info
>
>