[mpiwg-ft] MPI_Comm_revoke behavior
richardg at mellanox.com
Wed Nov 27 13:54:55 CST 2013
From: mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org] On Behalf Of George Bosilca
Sent: Wednesday, November 27, 2013 2:48 PM
To: MPI WG Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [mpiwg-ft] MPI_Comm_revoke behavior
On Nov 27, 2013, at 20:33 , Richard Graham <richardg at mellanox.com<mailto:richardg at mellanox.com>> wrote:
I am thinking about the next step, and have some questions on the semantics of MPI_Comm_revoke()
What next step are you referring to?
[rich] To the full recovery stage. Post what we are talking about now.
- When the routine returns, can the communicator ever be used again ? If I remember correctly, the communicator is available for point-to-point traffic, but not collective traffic - is this correct ?
A revoked communicator is unable to support any communication (point-to-point or collective) with the exception of agree and shrink. If this is not clear enough in the current version of the proposal we should definitively address it.
[rich] does this mean all current state (aside from who is alive) associated with the communicator is gone ? Can't rely on continuing sending pending messages ?
Looking forward, if one wants to restart the failed ranks (let's assume we add support for this), what can be assume about the "repaired" communicator ? What can't I assume about this communicator ?
What you can assume depends on what is the meaning of "repaired". Already today one can spawn new processes and reconstruct a communicator identical to the original communicator before any fault. This can be done using MPI dynamics together with the agreement available in the ULFM proposal.
[rich] This implies that all outstanding traffic is flushed - is this correct ?
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft