[Mpi3-ft] Proposal Update: MPI_ANY_SOURCE, MPI_Abort, MPI_Kill

Joshua Hursey jjhursey at open-mpi.org
Tue Nov 2 13:37:10 CDT 2010


I updated the run-through stabilization proposal with some of the issues that were brought up during the past two meetings (MPI Forum meeting in Chicago, and teleconf last week).

Below is a high-level list of the changes, and below a few topics to discuss:
 * Added definition of 'fail-stop' and 'transient' process failure
 * Cleaned up the failure detection text to make it more precise.
 * Started working on the text clarification for MPI_ANY_SOURCE (see first paragraph of section on 'Ch.3')
 * Touched up the MPI_Gather example
 * Added note on MPI_Abort for further discussion
 * Added start of an interface for MPI_Kill

Proposal is here:
  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization


MPI_ANY_SOURCE
--------------
I made a first attempt at some language for the MPI_ANY_SOURCE specification:
  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization#Ch.3:Point-to-PointCommunication

After the teleconf last week it seemed that we probably cannot explicitly say much about the order of delivery of pending messages versus an error class when a process fails on the communicator.

What do people think about this?

If it helps for comparison, FT-MPI has a couple message modes that they specified (my summaries from [1]):
 FTMPI_MSG_MODE_RESET: All messages from any process dropped after a failure is detected.
 FTMPI_MSG_MODE_CONT : Only messages from the failed process are dropped after a failure is detected.


MPI_Abort
---------
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization#MPI_ABORT

It was mentioned during the last call that the intention of MPI_Abort is unclear in the case where you specify MPI_COMM_SELF as the argument. I added a couple notes to the wiki about this.

It is possible that MPI_Abort should be more carefully defined beyond just this case to, for example, clarify the intention of the communicator argument and its impact of all derived {inter|intra}communicators.


MPI_Kill
--------
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization#MPI_KILL

If MPI_Abort is meant to be a distributed 'exit' option for the MPI application to terminate the whole job, we probably want the ability to remove a specific process from the computation. The MPI_Kill command allows us to specify a specific process to terminate identified by communicator and rank. The process will be excluded from all communicators that it is a part of as if it failed.

Do people like this option? Have suggestions on further semantics?



Thoughts?

-- Josh


[1] http://icl.cs.utk.edu/projectsfiles/ftmpi/pubs/isc2004-FT-MPI.pdf


------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey





More information about the mpiwg-ft mailing list