[mpiwg-ft] MPI_Comm_revoke behavior

Sun Dec 8 19:14:23 CST 2013

Hi Guys,

Thanks for the detailed responses (and sorry for going off-topic).

Does the incarnation number have to be included in matching?

I assume that we would need to track an incarnation number for each context
ID.  When creating a communicator, we can't guarantee that you'll get the
same incarnation number everywhere for a given Context ID.  Do you select
the highest incarnation number from among the processes involved?  I assume
the incarnation counter is meant to be monotonic increasing -- what happens
when it reaches its maximum value?

 ~Jim.

On Thu, Dec 5, 2013 at 1:55 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:

> Yes you can free the revoked communicator so you can also reuse the
> context id. There are a few ways you can handle this. One is to add an
> incarnation number to the communicator or process (which is the route
> chosen by the UTK ULFM implementation). Another is that once you know a
> process has failed, you ignore all future messages from that process (this
> causes issues if you want to deal with transient failures, but we’ve
> defined the scope of the work to exclude those). I’m sure there are more,
> possibly cleverer ways of solving this but those came to mind first.
>
> Wesley
>
>
> On Dec 5, 2013, at 12:39 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>
> This is a little off-topic from the current discussion -- can revoking
> communicators could lead to leaking context IDs?  After revoking a
> communicator, can you safely free it and then reuse the same context ID
> without concerns about erroneous messages arriving with that context ID?
>  Or is that context ID dead for the remainder of the execution?
>
>  ~Jim.
>
>
> On Thu, Dec 5, 2013 at 10:49 AM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>
>> I think there’s still confusion here between the current FT proposal and
>> the roadmap for previous proposals. I think the previous proposal was
>> designed to be a stopgap solution that would allow MPI to stabilize, but
>> not necessarily fully recover without a lot of work. The current proposal
>> provides all the tools necessary for both stabilization and recovery. Thus
>> we haven’t been discussing the “next step”, because I think most of us see
>> the next step not as a standardization effort for failure recovery, but
>> something outside of the standard such as the libraries that will make FT
>> easier to use.
>>
>> It’s true that MPI_COMM_REVOKE will make a current communicator unusable
>> for further communication. Period. However, you can still query all of the
>> local data about the communicator (rank, size, topology, info keys, etc.)
>> to allow you to reconstruct a new communicator. That reconstruction is very
>> much possible with the current proposal. It’s obviously not cheap, but I
>> think we all know that FT recovery isn’t necessarily cheap. I think
>> somewhere along the way we even had an example of how you could reconstruct
>> a communicator with processes retaining their original ranks (though if
>> you’re not going to replace the failed processes, I’m not sure what use it
>> is to retain your rank since your algorithm will need to be able to handle
>> such things anyway).
>>
>> MPI_COMM_REVOKE is the mechanism by which a process notifies other ranks
>> of errors. No user-level protocol is necessary, though there are specific
>> cases where a user-level solution might be a better solution depending on
>> the communication patter. When you call MPI_COMM_REVOKE, you give up on any
>> remaining (incoming) traffic that’s outstanding on the communicator. For
>> remote processes, they too will not receive any messages after they return
>> the error code MPI_COMM_REVOKED. This doesn’t preclude them from delaying
>> reporting MPI_ERR_REVOKED while they flush any messages that have already
>> been received, however, once the error code is returned, the communicator
>> is dead. Users have to be able to reason about this and I don’t think it’s
>> entirely unclear. Once revoke is called, that communicator is hosed. You
>> can create a new communicator and use it to decide what’s necessary to
>> recover your application (algorithm state, data repair, etc.). That’s the
>> guarantee that users get: once the communicator is revoked, all
>> messages/requests that have not been received are cancelled and will never
>> be completed. Any other guarantee invites lots of trouble as you have to
>> deal with race conditions involving who receives the revoke notification
>> before any other messages and whether or not all of the links necessary to
>> complete a message transfer are still available. The entire point of the
>> REVOKE call is a last resort operation to prevent deadlock. It’s entirely
>> possible that an application won’t need to use the revoke if there aren’t
>> complex communication dependencies. It may be possible to take care of
>> things at the user level in a way that’s cheaper than using REVOKE.
>>
>> The entire basis of this proposal is that we can’t provide a fault
>> tolerance solution that will cheaply stabilize MPI, repair communicators,
>> and notify all processes of failures. It’s possible to do all of these
>> things in a way that’s not outrageously expensive, but they require
>> cooperation from the application's perspective (or libraries that act on
>> the application’s behalf). It’s possible to do many of those things, but
>> they are so costly that they would never be approved. It’s much more
>> effective to do them on top of MPI. We’re just providing the low level
>> tools necessary to make it happen.
>>
>> Wesley
>>
>> On Dec 5, 2013, at 8:14 AM, Richard Graham <richardg at mellanox.com> wrote:
>>
>>
>>
>> *From:* mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org<mpiwg-ft-bounces at lists.mpi-forum.org>
>> ] *On Behalf Of *George Bosilca
>> *Sent:* Wednesday, November 27, 2013 3:35 PM
>> *To:* MPI WG Fault Tolerance and Dynamic Process Control working Group
>> *Subject:* Re: [mpiwg-ft] MPI_Comm_revoke behavior
>>
>>
>>
>> On Nov 27, 2013, at 20:54 , Richard Graham <richardg at mellanox.com> wrote:
>>
>>
>>  On Nov 27, 2013, at 20:33 , Richard Graham <richardg at mellanox.com>
>> wrote:
>>
>>
>>
>>  I am thinking about the next step, and have some questions on the
>> semantics of MPI_Comm_revoke()
>>
>> What next step are you referring to?
>>
>> [rich] To the full recovery stage.  Post what we are talking about now.
>>
>>
>> Full recovery stage? Can you expose a little more details here please.
>>
>> [rich] the original intent was to allow for full restoration of
>> communicators after failure, with minimal impact on those ranks that did
>> not fail (don’t want to get into what that means now …).  Those goals were
>> reduced for pragmatic reasons.  I want to make sure that when/if there is
>> work continued in this direction, the current proposal does not preclude
>> this.  One of  the issues raised to me recently is that after a revoke one
>> will not be able to accomplish such a goal on the remaining ranks – e.g.,
>> ranks will be reassigned.  I am following up very specifically on this
>> question.
>>
>>
>>
>>
>>
>>  -          When the routine returns, can the communicator ever be used
>> again ?  If I remember correctly, the communicator is available for
>> point-to-point traffic, but not collective traffic – is this correct ?
>>
>> A revoked communicator is unable to support any communication
>> (point-to-point or collective) with the exception of agree and shrink. If
>> this is not clear enough in the current version of the proposal we should
>> definitively address it.
>>
>> [rich] does this mean all current state (aside from who is alive)
>> associated with the communicator is gone ?
>>
>>
>> Every deterministic information is still available (info and attributes).
>> You can look for the group of processes associated with the communicator,
>> as well as the group of failed. If what you are looking for is the possible
>> unexpected messages, this is up to the implementation (see below).
>> [rich] don’t understand
>>
>>
>> Can’t rely on continuing sending pending messages ?
>>
>>
>> Not on a revoked communicator. If continuing to exchange messages is a
>> requirement, the communicator should not be revoked.
>> [rich]  How does one then notify other ranks of the errors – does this
>> have to be a user-level protocol ?
>>
>>
>>
>>           Looking forward, if one wants to restart the failed ranks
>> (let’s assume we add support for this), what can be assume about the
>> “repaired” communicator ?  What can’t I assume about this communicator ?
>>
>> What you can assume depends on what is the meaning of “repaired”. Already
>> today one can spawn new processes and reconstruct a communicator identical
>> to the original communicator before any fault. This can be done using MPI
>> dynamics together with the agreement available in the ULFM proposal.
>>
>> [rich] This implies that all outstanding traffic is flushed – is this
>> correct ?
>>
>>
>> This is up to the MPI implementation. This is specified on the first
>> “Advice to implementors” on the second page.
>> [rich]  does not seem like a good idea – users should have guarantees on
>> what they get if they use MPI.
>>
>>   George.
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131208/91e52906/attachment-0001.html>