<br><font size=2 face="sans-serif">Sorry Bronis - </font>

<br>

<br><font size=2 face="sans-serif">I did not intend to ignore your use

case.</font>

<br>

<br><font size=2 face="sans-serif">I did mention that I have no worries

about asking MPI implementations to refrain from blocking future MPI calls

after an error is detected.  That was an implicit recognition of your

use case.</font>

<br>

<br><font size=2 face="sans-serif">The MPI standard already forbids having

an MPI call on one thread block progress on other threads.  I would

interpret that to include a case where a thread is blocked in a collective

communication or a MPI_Recv that will never be satisfied. That is, the

blocked MPI call cannot prevent other threads from using libmpi.  Requiring

libmpi to release any lock it took even when doing an error return would

be logical but may not be implied by what is currently written.</font>

<br>

<br><font size=2 face="sans-serif">Communicators provide a sort of isolation

that keeps stray crap from failed operations from spilling over (such as

eager sent message for which the MPI_Recv failed).  If the tool uses

its own threads and private communicators, I agree it is reasonable to

ask any libmpi to avoid sabotaging that communication.</font>

<br>

<br><font size=2 face="sans-serif">Where I get concerned is when we start

talking about affirmative requirements for distributed  MPI state

after an error</font>

<br>

<br><font size=2 face="sans-serif">           

       Dick </font>

<br>

<br><font size=2 face="sans-serif">Dick Treumann  -  MPI Team

          <br>

IBM Systems & Technology Group<br>

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601<br>

Tele (845) 433-7846         Fax (845) 433-8363<br>

</font>

<br>

<br>

<br>

<table width=100%>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">From:</font>

<td><font size=1 face="sans-serif">"Bronis R. de Supinski" <bronis@llnl.gov></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">To:</font>

<td><font size=1 face="sans-serif">"MPI 3.0 Fault Tolerance and Dynamic

Process Control working Group" <mpi3-ft@lists.mpi-forum.org></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Date:</font>

<td><font size=1 face="sans-serif">09/20/2010 12:46 PM</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Subject:</font>

<td><font size=1 face="sans-serif">Re: [Mpi3-ft] Defining the state of

MPI after an error</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Sent by:</font>

<td><font size=1 face="sans-serif">mpi3-ft-bounces@lists.mpi-forum.org</font></table>

<br>

<hr noshade>

<br>

<br>

<br><tt><font size=2><br>

Dick:<br>

<br>

You seem to be ignoring my use case. Specifically, I<br>

have tool threads that use MPI. Their use of MPI should<br>

be unaffected by all of the scenarios that you are raising.<br>

However, the standard provides no way for me to tell if<br>

they work correctly in these situations. I just have to<br>

cross my fingers and hope.<br>

<br>

FYI: Your implementation has long met this requirement<br>

(my hopes are not dashed with it). Others have begun to<br>

recently. In any event, I would like some way to tell...<br>

<br>

Further, it is useful in many other scenarios apply to know <br>

that the implementation intends to remain usable. I am not<br>

looking for a promise of correct execution; I am looking<br>

for a promise of best effort and accurate return codes.<br>

<br>

Bronis<br>

<br>

<br>

<br>

On Mon, 20 Sep 2010, Richard Treumann wrote:<br>

<br>

><br>

> If there is any question about whether these calls are still valid

after an error with an error handler that returns (MPI_ERRORS_RETURN or

user handler)<br>

><br>

> MPI_Abort,<br>

> MPI_Error_string<br>

> MPI_Error_class<br>

><br>

> I assume it should be corrected as a trivial oversight in the original

text.<br>

><br>

> I would regard the real issue as being the difficulty with assuring

the state of remote processes.<br>

><br>

> There is huge difficulty in making any promise about how an interaction

between a process that has not taken an error and one that has will behave.<br>

><br>

> For example, if there were a loop of 100 MPI_Bcast calls and on iteration

5, rank 3 uses a bad communicator, what is the proper state?  Either

a sequence number is mandated so the other ranks hang quickly or a sequence

number is prohibited so everybody keeps going until the "end"

when the missing MPI_Bcast becomes critical.  Of course, with no sequence

number, some tasks are stupidly using the iteration n-1 data for their

iteration n computation.<br>

><br>

><br>

><br>

><br>

><br>

><br>

> Dick Treumann  -  MPI Team<br>

> IBM Systems & Technology Group<br>

> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601<br>

> Tele (845) 433-7846         Fax (845) 433-8363<br>

><br>

_______________________________________________<br>

mpi3-ft mailing list<br>

mpi3-ft@lists.mpi-forum.org<br>

</font></tt><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft"><tt><font size=2>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</font></tt></a><tt><font size=2><br>

</font></tt>

<br>

<br>