[mpiwg-ft] MPI_Comm_revoke behavior

Jim Dinan james.dinan at gmail.com
Mon Dec 9 10:20:01 CST 2013


Assuming no more travel glitches, I should make it for our 4pm WG meeting.
*cross-fingers*
On Dec 9, 2013 6:59 AM, "Jim Dinan" <james.dinan at gmail.com> wrote:

> It would be helpful to me if we could have a longer discussion about this.
>  If we have to make changes to matching, that will have significant impact,
> especially on offload networks (e.g. limited to 64 bits of matching).  I
> would like to better understand how this works when the incarnation is not
> a part of the match, and how you can ensure that messages from a previous
> incarnation don't match on a reused context ID.
>
> Unfortunately, United canceled my flight from BOS->ORD this morning, and
> I'm not sure I'll make it for the WG.  It would be great if this topic can
> still be discussed in the WR.  We should make sure we have a complete story
> around how the proposal impacts matching and how an implementation that
> does not include incarnation number in the match functions reliably.
>
>  ~Jim.
>
>
> On Mon, Dec 9, 2013 at 12:01 AM, George Bosilca <bosilca at icl.utk.edu>wrote:
>
>> Jim,
>> The incarnation number is just a way, a mere optimization not needed for
>> correctness purposes. In fact, any MPI library currently supporting
>> non-collective MPI_COM_FREE face a similar issue today.
>>
>> Regarding the usage of an incarnation number, you are right, there is one
>> per communicator, it is part of the matching and must be monotonic. Its
>> size should be large enough to prevent the creation of coms with the same
>> id, while keeping in mind that creating a communicator is a synchronous
>> operation among participants. Thus, it can roll over without any issues in
>> case you reach three upper bound.
>>
>> George.
>>  On Dec 9, 2013 2:14 AM, "Jim Dinan" <james.dinan at gmail.com> wrote:
>>
>>> Hi Guys,
>>>
>>> Thanks for the detailed responses (and sorry for going off-topic).
>>>
>>> Does the incarnation number have to be included in matching?
>>>
>>> I assume that we would need to track an incarnation number for each
>>> context ID.  When creating a communicator, we can't guarantee that you'll
>>> get the same incarnation number everywhere for a given Context ID.  Do you
>>> select the highest incarnation number from among the processes involved?  I
>>> assume the incarnation counter is meant to be monotonic increasing -- what
>>> happens when it reaches its maximum value?
>>>
>>>  ~Jim.
>>>
>>>
>>> On Thu, Dec 5, 2013 at 1:55 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>>>
>>>> Yes you can free the revoked communicator so you can also reuse the
>>>> context id. There are a few ways you can handle this. One is to add an
>>>> incarnation number to the communicator or process (which is the route
>>>> chosen by the UTK ULFM implementation). Another is that once you know a
>>>> process has failed, you ignore all future messages from that process (this
>>>> causes issues if you want to deal with transient failures, but we’ve
>>>> defined the scope of the work to exclude those). I’m sure there are more,
>>>> possibly cleverer ways of solving this but those came to mind first.
>>>>
>>>> Wesley
>>>>
>>>>
>>>> On Dec 5, 2013, at 12:39 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>>>>
>>>> This is a little off-topic from the current discussion -- can revoking
>>>> communicators could lead to leaking context IDs?  After revoking a
>>>> communicator, can you safely free it and then reuse the same context ID
>>>> without concerns about erroneous messages arriving with that context ID?
>>>>  Or is that context ID dead for the remainder of the execution?
>>>>
>>>>  ~Jim.
>>>>
>>>>
>>>> On Thu, Dec 5, 2013 at 10:49 AM, Wesley Bland <wbland at mcs.anl.gov>wrote:
>>>>
>>>>> I think there’s still confusion here between the current FT proposal
>>>>> and the roadmap for previous proposals. I think the previous proposal was
>>>>> designed to be a stopgap solution that would allow MPI to stabilize, but
>>>>> not necessarily fully recover without a lot of work. The current proposal
>>>>> provides all the tools necessary for both stabilization and recovery. Thus
>>>>> we haven’t been discussing the “next step”, because I think most of us see
>>>>> the next step not as a standardization effort for failure recovery, but
>>>>> something outside of the standard such as the libraries that will make FT
>>>>> easier to use.
>>>>>
>>>>> It’s true that MPI_COMM_REVOKE will make a current communicator
>>>>> unusable for further communication. Period. However, you can still query
>>>>> all of the local data about the communicator (rank, size, topology, info
>>>>> keys, etc.) to allow you to reconstruct a new communicator. That
>>>>> reconstruction is very much possible with the current proposal. It’s
>>>>> obviously not cheap, but I think we all know that FT recovery isn’t
>>>>> necessarily cheap. I think somewhere along the way we even had an example
>>>>> of how you could reconstruct a communicator with processes retaining their
>>>>> original ranks (though if you’re not going to replace the failed processes,
>>>>> I’m not sure what use it is to retain your rank since your algorithm will
>>>>> need to be able to handle such things anyway).
>>>>>
>>>>> MPI_COMM_REVOKE is the mechanism by which a process notifies other
>>>>> ranks of errors. No user-level protocol is necessary, though there are
>>>>> specific cases where a user-level solution might be a better solution
>>>>> depending on the communication patter. When you call MPI_COMM_REVOKE, you
>>>>> give up on any remaining (incoming) traffic that’s outstanding on the
>>>>> communicator. For remote processes, they too will not receive any messages
>>>>> after they return the error code MPI_COMM_REVOKED. This doesn’t preclude
>>>>> them from delaying reporting MPI_ERR_REVOKED while they flush any messages
>>>>> that have already been received, however, once the error code is returned,
>>>>> the communicator is dead. Users have to be able to reason about this and I
>>>>> don’t think it’s entirely unclear. Once revoke is called, that communicator
>>>>> is hosed. You can create a new communicator and use it to decide what’s
>>>>> necessary to recover your application (algorithm state, data repair, etc.).
>>>>> That’s the guarantee that users get: once the communicator is revoked, all
>>>>> messages/requests that have not been received are cancelled and will never
>>>>> be completed. Any other guarantee invites lots of trouble as you have to
>>>>> deal with race conditions involving who receives the revoke notification
>>>>> before any other messages and whether or not all of the links necessary to
>>>>> complete a message transfer are still available. The entire point of the
>>>>> REVOKE call is a last resort operation to prevent deadlock. It’s entirely
>>>>> possible that an application won’t need to use the revoke if there aren’t
>>>>> complex communication dependencies. It may be possible to take care of
>>>>> things at the user level in a way that’s cheaper than using REVOKE.
>>>>>
>>>>> The entire basis of this proposal is that we can’t provide a fault
>>>>> tolerance solution that will cheaply stabilize MPI, repair communicators,
>>>>> and notify all processes of failures. It’s possible to do all of these
>>>>> things in a way that’s not outrageously expensive, but they require
>>>>> cooperation from the application's perspective (or libraries that act on
>>>>> the application’s behalf). It’s possible to do many of those things, but
>>>>> they are so costly that they would never be approved. It’s much more
>>>>> effective to do them on top of MPI. We’re just providing the low level
>>>>> tools necessary to make it happen.
>>>>>
>>>>> Wesley
>>>>>
>>>>> On Dec 5, 2013, at 8:14 AM, Richard Graham <richardg at mellanox.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> *From:* mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org<mpiwg-ft-bounces at lists.mpi-forum.org>
>>>>> ] *On Behalf Of *George Bosilca
>>>>> *Sent:* Wednesday, November 27, 2013 3:35 PM
>>>>> *To:* MPI WG Fault Tolerance and Dynamic Process Control working Group
>>>>> *Subject:* Re: [mpiwg-ft] MPI_Comm_revoke behavior
>>>>>
>>>>>
>>>>>
>>>>> On Nov 27, 2013, at 20:54 , Richard Graham <richardg at mellanox.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>  On Nov 27, 2013, at 20:33 , Richard Graham <richardg at mellanox.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>>  I am thinking about the next step, and have some questions on the
>>>>> semantics of MPI_Comm_revoke()
>>>>>
>>>>> What next step are you referring to?
>>>>>
>>>>> [rich] To the full recovery stage.  Post what we are talking about now.
>>>>>
>>>>>
>>>>> Full recovery stage? Can you expose a little more details here please.
>>>>>
>>>>> [rich] the original intent was to allow for full restoration of
>>>>> communicators after failure, with minimal impact on those ranks that did
>>>>> not fail (don’t want to get into what that means now …).  Those goals were
>>>>> reduced for pragmatic reasons.  I want to make sure that when/if there is
>>>>> work continued in this direction, the current proposal does not preclude
>>>>> this.  One of  the issues raised to me recently is that after a revoke one
>>>>> will not be able to accomplish such a goal on the remaining ranks – e.g.,
>>>>> ranks will be reassigned.  I am following up very specifically on this
>>>>> question.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  -          When the routine returns, can the communicator ever be
>>>>> used again ?  If I remember correctly, the communicator is available for
>>>>> point-to-point traffic, but not collective traffic – is this correct ?
>>>>>
>>>>> A revoked communicator is unable to support any communication
>>>>> (point-to-point or collective) with the exception of agree and shrink. If
>>>>> this is not clear enough in the current version of the proposal we should
>>>>> definitively address it.
>>>>>
>>>>> [rich] does this mean all current state (aside from who is alive)
>>>>> associated with the communicator is gone ?
>>>>>
>>>>>
>>>>> Every deterministic information is still available (info and
>>>>> attributes). You can look for the group of processes associated with the
>>>>> communicator, as well as the group of failed. If what you are looking for
>>>>> is the possible unexpected messages, this is up to the implementation (see
>>>>> below).
>>>>> [rich] don’t understand
>>>>>
>>>>>
>>>>> Can’t rely on continuing sending pending messages ?
>>>>>
>>>>>
>>>>> Not on a revoked communicator. If continuing to exchange messages is a
>>>>> requirement, the communicator should not be revoked.
>>>>> [rich]  How does one then notify other ranks of the errors – does this
>>>>> have to be a user-level protocol ?
>>>>>
>>>>>
>>>>>
>>>>>           Looking forward, if one wants to restart the failed ranks
>>>>> (let’s assume we add support for this), what can be assume about the
>>>>> “repaired” communicator ?  What can’t I assume about this communicator ?
>>>>>
>>>>> What you can assume depends on what is the meaning of “repaired”.
>>>>> Already today one can spawn new processes and reconstruct a communicator
>>>>> identical to the original communicator before any fault. This can be done
>>>>> using MPI dynamics together with the agreement available in the ULFM
>>>>> proposal.
>>>>>
>>>>> [rich] This implies that all outstanding traffic is flushed – is this
>>>>> correct ?
>>>>>
>>>>>
>>>>> This is up to the MPI implementation. This is specified on the first
>>>>> “Advice to implementors” on the second page.
>>>>> [rich]  does not seem like a good idea – users should have guarantees
>>>>> on what they get if they use MPI.
>>>>>
>>>>>   George.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpiwg-ft mailing list
>>>>> mpiwg-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpiwg-ft mailing list
>>>>> mpiwg-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>>
>>>>
>>>> _______________________________________________
>>>> mpiwg-ft mailing list
>>>> mpiwg-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpiwg-ft mailing list
>>>> mpiwg-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>
>>>
>>>
>>> _______________________________________________
>>> mpiwg-ft mailing list
>>> mpiwg-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131209/490360b9/attachment-0001.html>


More information about the mpiwg-ft mailing list