[Mpi3-ft] Defining the state of MPI after an error
treumann at us.ibm.com
Thu Sep 23 09:40:02 CDT 2010
A few quick observations:
The constant is MPI_ERROR_ARE_FATAL, not MPI_ERRORS_ABORT
The MPI standard only mandates one return code, MPI_SUCCESS. All other
return codes are implementation specific and non-portable. For
portability, MPI documents error classes and a query function that is
passed an implementation defined return code and returns the class.
Assume I allow tags between 0 and 2**15. As an MPI implementor, I am free
to use return code 215 for a negative tag and 399 for one that is above
2**15. The error message I print for 215 and the error message I print
for 399 can be different. If the user calls MPI_ERROR_CLASS() with either
215 or 399 I give back the class MPI_ARR_TAG. The user who checks the RC
of a call to see if it is == MPI_ERR_TAG has written non-portable code.
If I decide return codes 251 and 399 must be in class MPI_CANNOT_CONTINUE
they can no longer be in class MPI_ERR_TAG.
The MPI standard avoids mandating specific error checks. It identifies a
lot of errors and in many cases, says what error class that error is in.
It does not say an implementation MUST detect the error. I would not
violate the standard by skipping the check of whether MPI is initialized.
My customers may demand it but the standard does not. You are introducing
a mandate for one specific sort of error.
I am convinced that the intent of the standard is to require
MPI_ERROR_CLASS, MPI_ERROR_STRING and MPI_ABORT to work after an
ERRORS_RETURN. If this is insufficiently clear, it should probably be
addressed in a stand alone ticket. (it is certainly possible for an error
(detected or not) to trash internal state and for that to make one of
these three unusable but that applies to every MPI call. The standard does
not say MPI_Send must work even if state was scrambled by a wild store). I
do not know if anybody assumed MPI_INITIALIZED and MPI_FINALIZED must work
after an error. I see no harm in requiring it.
Finally - I do not see that the ticket does anything useful. In
particular, it does not provide any portability improvements I can see.
The MPI implementation could offer a TIMID vs ADVENTUROUS switch
TIMID - MPI query functions like MPI_COMM_SIZE and MPI_ALLOC_MEM do not
trigger CANNOT_CONTINUE but every other error does.
ADVENTUROUS - no error triggers CANNOT_CONTINUE.
The default would probably need to be TIMID because if the default were
ADVENTUROUS, it would open the implementor to an accusation of failing to
protect the customer. There can be no such accusation now because the
standard does not imply the implementation should protect the customer.
I have no clue from the ticket what would be a reasonable or portable
middle ground. I see the proposal as harmful because any attempt to use
it will produce an illusion of portability when implementors try to find a
middle ground without guidance form the standard.
Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
Joshua Hursey <jjhursey at open-mpi.org>
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group"
<mpi3-ft at lists.mpi-forum.org>
09/23/2010 08:57 AM
Re: [Mpi3-ft] Defining the state of MPI after an error
mpi3-ft-bounces at lists.mpi-forum.org
(Bringing a lot of points together in a single response)
The ticket that we are discussing is linked below (also part of the very
first email in this thread):
< snip >
I deleted the discussion because only the ticket counts now.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft