[Mpi3-ft] Proposal Update: MPI_ANY_SOURCE, MPI_Abort, MPI_Kill

Joshua Hursey jjhursey at open-mpi.org
Tue Nov 2 13:37:10 CDT 2010

I updated the run-through stabilization proposal with some of the issues that were brought up during the past two meetings (MPI Forum meeting in Chicago, and teleconf last week).

Below is a high-level list of the changes, and below a few topics to discuss:
 * Added definition of 'fail-stop' and 'transient' process failure
 * Cleaned up the failure detection text to make it more precise.
 * Started working on the text clarification for MPI_ANY_SOURCE (see first paragraph of section on 'Ch.3')
 * Touched up the MPI_Gather example
 * Added note on MPI_Abort for further discussion
 * Added start of an interface for MPI_Kill

Proposal is here:

I made a first attempt at some language for the MPI_ANY_SOURCE specification:

After the teleconf last week it seemed that we probably cannot explicitly say much about the order of delivery of pending messages versus an error class when a process fails on the communicator.

What do people think about this?

If it helps for comparison, FT-MPI has a couple message modes that they specified (my summaries from [1]):
 FTMPI_MSG_MODE_RESET: All messages from any process dropped after a failure is detected.
 FTMPI_MSG_MODE_CONT : Only messages from the failed process are dropped after a failure is detected.


It was mentioned during the last call that the intention of MPI_Abort is unclear in the case where you specify MPI_COMM_SELF as the argument. I added a couple notes to the wiki about this.

It is possible that MPI_Abort should be more carefully defined beyond just this case to, for example, clarify the intention of the communicator argument and its impact of all derived {inter|intra}communicators.


If MPI_Abort is meant to be a distributed 'exit' option for the MPI application to terminate the whole job, we probably want the ability to remove a specific process from the computation. The MPI_Kill command allows us to specify a specific process to terminate identified by communicator and rank. The process will be excluded from all communicators that it is a part of as if it failed.

Do people like this option? Have suggestions on further semantics?


-- Josh

[1] http://icl.cs.utk.edu/projectsfiles/ftmpi/pubs/isc2004-FT-MPI.pdf

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list