[Mpi3-ft] Defining the state of MPI after an error

Thu Sep 23 09:40:02 CDT 2010

A few quick observations:

0) 
The constant is MPI_ERROR_ARE_FATAL, not MPI_ERRORS_ABORT

1) 
The MPI standard only mandates one return code, MPI_SUCCESS. All other 
return codes are implementation specific and non-portable.  For 
portability, MPI documents  error classes and a query function that is 
passed an implementation defined return code and returns the class.

Assume I allow tags between 0 and 2**15. As an MPI implementor, I am free 
to use return code 215 for a negative tag and 399 for one that is above 
2**15.  The error message I print for 215 and the error message I print 
for 399 can be different. If the user calls MPI_ERROR_CLASS() with either 
215  or 399 I give back the class MPI_ARR_TAG.  The user who checks the RC 
of a call to see if it is == MPI_ERR_TAG has written non-portable code.

If I decide  return codes 251 and 399 must be in class MPI_CANNOT_CONTINUE 
they can no longer be in class MPI_ERR_TAG.

2)
The MPI standard avoids mandating specific error checks.  It identifies a 
lot of errors and in many cases, says what error class that error is in. 
It does not say an implementation MUST detect the error. I would not 
violate the standard by skipping the check of whether MPI is initialized. 
My customers may demand it but the standard does not.  You are introducing 
a mandate for one specific sort of error.

3)
I am convinced that the intent of the standard is to require 
MPI_ERROR_CLASS, MPI_ERROR_STRING and MPI_ABORT to work after an 
ERRORS_RETURN. If this is insufficiently clear, it should probably be 
addressed in a stand alone ticket.  (it is certainly possible for an error 
(detected or not) to trash internal state and for that to make one of 
these three unusable but that applies to every MPI call. The standard does 
not say MPI_Send must work even if state was scrambled by a wild store). I 
do not know if anybody assumed MPI_INITIALIZED and MPI_FINALIZED must work 
after an error. I see no harm in requiring it.

5) 
Finally - I do not see that the ticket does anything useful.  In 
particular, it does not provide any portability improvements I can see.

The MPI implementation could offer a TIMID vs ADVENTUROUS switch 
(environment variable)

TIMID - MPI query functions like MPI_COMM_SIZE and MPI_ALLOC_MEM do not 
trigger CANNOT_CONTINUE but every other error does.

ADVENTUROUS - no error triggers CANNOT_CONTINUE. 

The default would probably need to be TIMID because if the default were 
ADVENTUROUS, it would open the implementor to an accusation of failing to 
protect the customer. There can be no such accusation now because the 
standard does not imply the implementation should protect the customer. 

I have no clue from the ticket what would be a reasonable or portable 
middle ground.    I see the proposal as harmful because any attempt to use 
it will produce an illusion of portability when implementors try to find a 
middle ground without guidance form the standard.

Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

From:
Joshua Hursey <jjhursey at open-mpi.org>
To:
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" 
<mpi3-ft at lists.mpi-forum.org>
Date:
09/23/2010 08:57 AM
Subject:
Re: [Mpi3-ft] Defining the state of MPI after an error
Sent by:
mpi3-ft-bounces at lists.mpi-forum.org

(Bringing a lot of points together in a single response)

The ticket that we are discussing is linked below (also part of the very 
first email in this thread):
  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue

< snip >

I deleted the discussion because only the ticket counts now.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100923/dc8c604c/attachment-0001.html>