[Mpi3-ft] Fault Tolerance Query Interface

Josh Hursey jjhursey at open-mpi.org
Wed Mar 14 09:19:25 CDT 2012

During the MPI Forum meeting it was requested that the FT WG add the
ability to query the implementation to determine if it supports the
functionality described in the new proposal. Such a query interface
would allow the user to determine whether they should use a code path
including fault recovery techniques, or use an alternative path that
does not include such error checking.

Note that both execution paths must be supported by the MPI
implementation, but if the implementation will never return the error
codes defined in that chapter (and makes 'MPI_Comm_shrink' a
'MPI_Comm_dup' and 'MPI_Comm_invalidate' a noop) then the user is
doing extra work that is not necessary for that implementation.
Further, if the implementation does not support the functionality,
this would provide the user with an early sanity check and allow them
to bail out before wasting time on the machine.

At bottom are a few suggestions for how to provide this functionality.
This would be a ticket targeted at 3.1. It is not necessary
functionality for the current set of tickets (e.g., 323), just a user
convenience interface.

The MPI Predefined Attribute option sounds the best to me, though the
MPI_T interface extension is interesting as well. What do others

-- Josh

MPI Defined Attribute:
Section 8.1.2 of the MPI 2.2 standard defined a small set of
attributes that are defined for MPI_COMM_WORLD to "describe the
execution environment." We could add a new attribute:
 - MPI_SUPPORT_PROC_FAILURE : Boolean variable that indicates whether
the implementation is able to provide support for the behavior
specified in Chapter 17.

The MPI implementation would have to define this 'key' so users can
portably query it. The 'value' should be set to 'true' if the
functionality is support, and all other values indicate that the
implementation does not support the functionality.

Explicit MPI Function(s):
MPI_FT_QUERY(bool &supported);
A general query interface to determine if the functionality in Chapter
17 is supported.

We could also explore a per-communication object interface to allow
for future implementation flexibility (though it would be more
difficult for the user to program against).
MPI_COMM_FT_QUERY(MPI_Comm comm, bool &supported);

We could also have an an initialization function, similar to threads:
MPI_INIT_FT(required, provided);

MPI_T interface:
Extend the interface, as appropriate and in coordination with the
tools group, to allow the user to query the implementation to
determine support for various error codes. This would possibly allow
us to extend beyond the error codes defined in Chapter 17.

So that users can query for 'how well supported is MPI_ERR_X' or 'what
is the state of the implementation after returning MPI_ERR_X'. For
example, returning MPI_ERR_ARG is not critical in most implementation
configurations (but in some it may be). So this would allow the user
to ask the MPI implementation if it can continue using MPI after an
error, or what it can do after the error is returned.

This thread might also be interesting to consider for this point:

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list