[Mpi3-ft] Defining the state of MPI after an error
treumann at us.ibm.com
Mon Sep 20 12:03:09 CDT 2010
Sorry Bronis -
I did not intend to ignore your use case.
I did mention that I have no worries about asking MPI implementations to
refrain from blocking future MPI calls after an error is detected. That
was an implicit recognition of your use case.
The MPI standard already forbids having an MPI call on one thread block
progress on other threads. I would interpret that to include a case where
a thread is blocked in a collective communication or a MPI_Recv that will
never be satisfied. That is, the blocked MPI call cannot prevent other
threads from using libmpi. Requiring libmpi to release any lock it took
even when doing an error return would be logical but may not be implied by
what is currently written.
Communicators provide a sort of isolation that keeps stray crap from
failed operations from spilling over (such as eager sent message for which
the MPI_Recv failed). If the tool uses its own threads and private
communicators, I agree it is reasonable to ask any libmpi to avoid
sabotaging that communication.
Where I get concerned is when we start talking about affirmative
requirements for distributed MPI state after an error
Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
"Bronis R. de Supinski" <bronis at llnl.gov>
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group"
<mpi3-ft at lists.mpi-forum.org>
09/20/2010 12:46 PM
Re: [Mpi3-ft] Defining the state of MPI after an error
mpi3-ft-bounces at lists.mpi-forum.org
You seem to be ignoring my use case. Specifically, I
have tool threads that use MPI. Their use of MPI should
be unaffected by all of the scenarios that you are raising.
However, the standard provides no way for me to tell if
they work correctly in these situations. I just have to
cross my fingers and hope.
FYI: Your implementation has long met this requirement
(my hopes are not dashed with it). Others have begun to
recently. In any event, I would like some way to tell...
Further, it is useful in many other scenarios apply to know
that the implementation intends to remain usable. I am not
looking for a promise of correct execution; I am looking
for a promise of best effort and accurate return codes.
On Mon, 20 Sep 2010, Richard Treumann wrote:
> If there is any question about whether these calls are still valid after
an error with an error handler that returns (MPI_ERRORS_RETURN or user
> I assume it should be corrected as a trivial oversight in the original
> I would regard the real issue as being the difficulty with assuring the
state of remote processes.
> There is huge difficulty in making any promise about how an interaction
between a process that has not taken an error and one that has will
> For example, if there were a loop of 100 MPI_Bcast calls and on
iteration 5, rank 3 uses a bad communicator, what is the proper state?
Either a sequence number is mandated so the other ranks hang quickly or a
sequence number is prohibited so everybody keeps going until the "end"
when the missing MPI_Bcast becomes critical. Of course, with no sequence
number, some tasks are stupidly using the iteration n-1 data for their
iteration n computation.
> Dick Treumann - MPI Team
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846 Fax (845) 433-8363
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft