[Mpi3-ft] simplified FT proposal

Sur, Sayantan sayantan.sur at intel.com
Sun Jan 15 20:00:01 CST 2012

Hi Bill,

I am in agreement with your suggestion to have a collective over a communicator that returns a new communicator containing ranks "alive some point during construction". It provides cleaner semantics. The example was merely trying to utilize the new MPI_Comm_create_group API that the Forum is considering.

MPI_Comm_check provides a method to form global consensus in that all ranks in comm did call it. It does not imply anything about current status of comm, or even the status "just before" the call returns. During the interval before the next call to MPI_Comm_check, it is possible that many ranks fail. However, the app/lib using MPI knows the point where everyone was alive.


Sayantan Sur, Ph.D.
Intel Corp.

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of William Gropp
Sent: Sunday, January 15, 2012 2:41 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] simplified FT proposal

One concern that I have with fault tolerant proposals has to do with races in the specification.  This is an area where users often "just want it to work" but getting it right is tricky.  In the example here, the "alive_group" is really only that at some moment shortly before "MPI_Comm_check" returns (and possibly not even that).  After that, it is really the "group_of_processes_that_was_alive_at_some_point_in_the_past".  Since there are sometimes correlations in failures, this could happen even if the initial failure is rare.  An alternate form might be to have a routine, collective over a communicator, that returns a new communicator meeting some definition of "members were alive at some point during construction".  It wouldn't guarantee you could use it, but it would have cleaner semantics.


On Jan 13, 2012, at 3:41 PM, Sur, Sayantan wrote:

I would like to argue for a simplified version of the proposal that covers a large percentage of use-cases and resists adding new "features" for the full-range of ABFT techniques. It is good if we have a more pragmatic view and not sacrifice the entire FT proposal for the 1% fringe cases. Most apps just want to do something like this:

for(... really long time ...) {
   MPI_Comm_check(work_comm, &is_ok, &alive_group);
   if(!is_ok) {
       MPI_Comm_create_group(alive_group, ..., &new_comm);
      // re-balance workload and use new_comm in rest of computation
       MPI_Comm_free(work_comm); // get rid of old comm
       work_comm = new_comm;
   } else {
     // continue computation using work_comm
     // if some proc failed in this iteration, roll back work done in this iteration, go back to loop

William Gropp
Director, Parallel Computing Institute
Deputy Director for Research
Institute for Advanced Computing Applications and Technologies
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120116/aa841326/attachment-0001.html>

More information about the mpiwg-ft mailing list