[Mpi3-ft] Defining the state of MPI after an error

Mon Sep 20 11:44:59 CDT 2010

Dick:

You seem to be ignoring my use case. Specifically, I
have tool threads that use MPI. Their use of MPI should
be unaffected by all of the scenarios that you are raising.
However, the standard provides no way for me to tell if
they work correctly in these situations. I just have to
cross my fingers and hope.

FYI: Your implementation has long met this requirement
(my hopes are not dashed with it). Others have begun to
recently. In any event, I would like some way to tell...

Further, it is useful in many other scenarios apply to know 
that the implementation intends to remain usable. I am not
looking for a promise of correct execution; I am looking
for a promise of best effort and accurate return codes.

Bronis

On Mon, 20 Sep 2010, Richard Treumann wrote:

>
> If there is any question about whether these calls are still valid after an error with an error handler that returns (MPI_ERRORS_RETURN or user handler)
>
> MPI_Abort,
> MPI_Error_string
> MPI_Error_class
>
> I assume it should be corrected as a trivial oversight in the original text.
>
> I would regard the real issue as being the difficulty with assuring the state of remote processes.
>
> There is huge difficulty in making any promise about how an interaction between a process that has not taken an error and one that has will behave.
>
> For example, if there were a loop of 100 MPI_Bcast calls and on iteration 5, rank 3 uses a bad communicator, what is the proper state?  Either a sequence number is mandated so the other ranks hang quickly or a sequence number is prohibited so everybody keeps going until the "end" when the missing MPI_Bcast becomes critical.  Of course, with no sequence number, some tasks are stupidly using the iteration n-1 data for their iteration n computation.
>
>
>
>
>
>
> Dick Treumann  -  MPI Team
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846         Fax (845) 433-8363
>