[mpiwg-ft] Madrid Report

Wed Sep 25 17:10:10 CDT 2013

Wesley:

An approach similar to Init_Thread seems essential.

As to what I was suggesting, it would not require any
polling or (additional) remote communication. It would
ideally be something along the lines of "MPI returned
this error code to me. If I make the same call again
will I get the same error code." The interface would
require some way to identify the call as well as the
error code. The answer would be easy for many types
of errors. For example, if the error corresponds to
bad arguments then the implementation can say that
"Yes, you will always get that error." However, the
implementation should be able to say that calls
with correct arguments will work. As the standard
currently is specified, one call with bad arguments
means all calls are is hosed, which is clearly unnecessary.

Other error codes could provide different results. However,
the standard does not need to mandate them. For example,
if you have a process fail then you could indicate any
communication involving that process would fail. You seem
to be claiming that is basically the current proposed
interface. However, under a best effort query interface,
the user must actively request information for each error
code. Thus, the implementation could validly choose any of:

1. No machinery to detect remote process failures so
    the best guess is that it will continue to work;
2. A heuristic to detect remote process failure (e.g.,
    the last N messages to that process failed);
3. A more complicated solution, likely corresponding
    to what the current proposal requires.

The second option is probably the best choice without
inexpensive detection that the failure has occurred.
It might be pessimistic at times but most users
would be OK with that. Most users would strongly
prefer that occasional pessimism in exchange for
higher performance.

Bronis

On Wed, 25 Sep 2013, Wesley Bland wrote:

> Hi Bronis,
>
> Thanks for the feedback. As Christian mentioned, most of what this proposal does is failure notification (as opposed to recovery). The biggest change (in terms of implementation) is found in the text in section 17.2.2:
>
> "In all other cases, the operation raises an exception of class MPI_ERR_PROC_FAILED to indicate that the failure prevents the operation from following its failure-free specification. If there is a request identifying the point-to-point communication, it is completed. Future point-to-point communication with the same process on this communicator must also raise MPI_ERR_PROC_FAILED."
>
> This (along with some other similar passages for things like collectives, startup/finalize, dynamic processes, etc. found in 17.2) specify how MPI should behave after a failure. There is other text sprinkled around the other chapters to clean up the sections you mentioned that obviously need to be updated to mention that we will now handle failure, but the other new additions are relatively minor from an implementation perspective. Also, it's our intent to combine ticket 336 (https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/336) into this proposal to allow the user to query whether the implementation supports FT (this might show up in the form of an INIT_THREADS style request/supported system, but we still have to discuss it further). So if the user doesn't want/need FT, they're welcome to ignore it.
>
> As far as querying whether an error will persist, that requires an entirely different set of semantics. That would necessitate standardizing transient errors and specifying how an implementation should handle them. We're trying to make things as simple as possible. Either a process has failed and will always be failed, or it is still correct. As Christian mentioned, this can and should probably be handled with an exception system like the MPI Errhandler callbacks.
>
> I may be misunderstanding your meaning, but I think periodically polling the system to check if some process or set of processes is still alive would be rather cumbersome. Instead, the library can alert you when the failure occurs and do the hard work for you. Everything else we provide is a follow on from that, but it's designed to be minimally intrusive.
>
> To your last point, I'm certainly not rooting for a narrow passage, and after the most recent voting rules update, I don't think such a thing is possible anymore. We're not trying to ram this down anyone's throat at all. That's why we've been socializing it so much to get as much feedback as possible. We want to take into consideration all of the feedback that we're getting and provide rationale whenever we can. The reason we have been targeting December is that the feedback that I've been aware of as far back as the last vote in Japan was not that the proposal was technically deficient, but that it was too young and unstable. We've given it another year and a half to acquire more users, implementations, feedback, etc. and time to stabilize. We believe we've done that and are ready to give it another shot. The text is largely unchanged from the previous reading other than a few clarifications and some additions to the RMA and I/O sections. More feedback is encouraged and welcomed and if there's something that we can fix, we'll do so.
>
> Thanks,
> Wesley
>
> On Sep 25, 2013, at 1:43 PM, "Bronis R. de Supinski" <bronis at llnl.gov> wrote:
>
>>
>> Christian:
>>
>> My suggestion was not to remove a functipon from the existing
>> proposal in order to provide a query capability. My proposal
>> was to re-evaluate the decision to jump to a complex solution
>> prior to addressing the primary issue for most users.
>>
>> I am aware of the lineage of the paragraph that I cited. I
>> do not agree that its age implies it cannot be fixed. In
>> fact, no FT proposal has any value if it does not address
>> that paragraph. My suggestion was to take a simpler approach
>> that allows a user to query whether an error will persist
>> and, if so, does it imply other failures for various
>> classes of operations. Such an interface would be best
>> effort but would generally allow a simple "I can keep
>> trying (possibly ina slightly modified form until I think
>> the implementation does not know it is hosed" approach.
>> More complex approaches could follow but they would be
>> imprved by having such a query interface available.
>>
>> Bronis
>>
>>
>>
>>
>>
>>
>>
>> On Wed, 25 Sep 2013, Christian Engelmann wrote:
>>
>>>
>>> Bronis, the problem is that the paragraph you pointed out (and quoted below for reference) has been part of MPI since version 1.1. MPI was simply not conceived with fault tolerance or fault awareness in mind. Moreover, MPI was designed with a simplistic fault model, i.e., the state after any error is undefined. Rectifying this almost 20 years later is a difficult task.
>>>
>>> I agree that the proposed solution is not easy to comprehend from an application or library developer's point of view. I do think that fault tolerant applications that demonstrate the capabilities and usefulness of the proposed enhancements would help a lot.
>>>
>>> I also think that there is a general misunderstanding on how this interface is supposed to be used. Looping around every MPI call to catch and recover from potential errors is certainly a quite nonsensical approach. Instead, the more practical approach is transaction- or exception-based programming. This requires programming templates atop the proposed MPI enhancements.
>>>
>>> Once again, demonstrations using fault tolerant applications would really help. Martin already pointed to some. I know of other work by UT (Jack's group), UCR (Zizhong Chen), and UoH (Edgar Gabriel), all based on UT's FT-MPI from 2003. Moving those to the proposed MPI enhancements would help greatly.
>>>
>>> The only failure recovery function proposed is MPI_Comm_shrink(). Removing this, simply means that only point-to-point communication will work after a process fault, but not collectives. A point-to-point only MPI is pretty useless in my opinion.
>>>
>>> Christian
>>>
>>> On Sep 25, 2013, at 12:51 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>>>
>>>> This document does not specify the state of a computation
>>>> after an erroneous MPI call has occurred. The desired
>>>> behavior is that a relevant error code be returned, and
>>>> the effect of the error be localized to the greatest
>>>> possible extent.
>>>
>>> --
>>>
>>> Christian Engelmann, Ph.D.
>>>
>>> System Software Team Task Lead / R&D Staff Scientist
>>> Computer Science Research Group
>>> Computer Science and Mathematics Division
>>> Oak Ridge National Laboratory
>>>
>>> Mail: P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
>>> Phone: +1 (865) 574-3132 / Fax: +1 (865) 576-5491
>>> e-Mail: engelmannc at ornl.gov / Home: www.christian-engelmann.info
>>>
>>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
>