<div>Bill,</div><div><br></div><div>(A bit of an aside) I completely agree that creating a fault tolerant application is a tricky endeavor even for the most heroic of developers. Developing a 'self-stabilizing' application is difficult, and will require extensive experimentation to derive appropriate algorithms, applications, and libraries. Some work has happened in this space, but there is a great need for more research. I worry about specifying something that is too restrictive to developer (overreaching our responsibilities in a sense), and thus stifling what researchers can experiment with. The way I have approached the task of defining a fault tolerant MPI standard is by asking 'how should this specific interface behave when a process fails?' When there are multiple options then I have tried to gather application preference and further comment. Then try to weave those specific solutions into a pattern that can be applied throughout the standard for consistency. I believe that the majority of the RTS proposal is correct in this regard, and most of the contention seems to be over specific choices when multiple solutions are on the table (ANY_SOURCE is a great example). To that end we have to better articulate the decision process. There are other features that seem reaching, and we need to assess if those are necessary or supplementary.</div>

<div><br></div><div>You are correct that the information you receive from a MPI_Comm_check/validate call is only representative of the known failures at the time of the call. So it is likely that additional processes failed just as the validate operation finishes making the data old.</div>

<div><br></div><div>The RTS proposal modified/clarified the semantics of MPI_Comm_split() so that if the communicator is created then it contains only the alive processes that called the operation. Of course additional processes may have failed just after creation, but there is nothing we can do about that. Is such an operation what you are looking for?</div>

<div><br></div><div>-- Josh</div><div><br></div><br><div class="gmail_quote">On Sun, Jan 15, 2012 at 5:40 PM, William Gropp <span dir="ltr"><<a href="mailto:wgropp@illinois.edu">wgropp@illinois.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">One concern that I have with fault tolerant proposals has to do with races in the specification.  This is an area where users often "just want it to work" but getting it right is tricky.  In the example here, the "alive_group" is really only that at some moment shortly before "MPI_Comm_check" returns (and possibly not even that).  After that, it is really the "group_of_processes_that_was_alive_at_some_point_in_the_past".  Since there are sometimes correlations in failures, this could happen even if the initial failure is rare.  An alternate form might be to have a routine, collective over a communicator, that returns a new communicator meeting some definition of "members were alive at some point during construction".  It wouldn't guarantee you could use it, but it would have cleaner semantics.<div>

<div><br></div><div>Bill</div><div><div class="im"><br><div><div>On Jan 13, 2012, at 3:41 PM, Sur, Sayantan wrote:</div><br><blockquote type="cite"><span style="border-collapse:separate;font-family:Helvetica;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:-webkit-auto;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;font-size:medium"><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">

I would like to argue for a simplified version of the proposal that covers a large percentage of use-cases and resists adding new “features” for the full-range of ABFT techniques. It is good if we have a more pragmatic view and not sacrifice the entire FT proposal for the 1% fringe cases. Most apps just want to do something like this:<u></u><u></u></div>

<div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif"><u></u> <u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">

for(… really long time …) {<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">   MPI_Comm_check(work_comm, &is_ok, &alive_group);<u></u><u></u></div>

<div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">   if(!is_ok) {<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">

       MPI_Comm_create_group(alive_group, …, &new_comm);<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">      // re-balance workload and use new_comm in rest of computation<u></u><u></u></div>

<div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">       MPI_Comm_free(work_comm); // get rid of old comm<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">

       work_comm = new_comm;<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">   } else {<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">

     // continue computation using work_comm<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">     // if some proc failed in this iteration, roll back work done in this iteration, go back to loop<u></u><u></u></div>

<div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">   }<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif">

}<u></u><u></u></div><div style="margin-top:0in;margin-right:0in;margin-left:0in;margin-bottom:0.0001pt;font-size:11pt;font-family:Calibri,sans-serif"><u></u> <u></u></div></span></blockquote></div><br></div><span class="HOEnZb"><font color="#888888"><div>


<span style="text-indent:0px;letter-spacing:normal;font-variant:normal;text-align:auto;font-style:normal;font-weight:normal;line-height:normal;border-collapse:separate;text-transform:none;font-size:medium;white-space:normal;font-family:Helvetica;word-spacing:0px"><div>

<div style="font-size:12px">William Gropp</div><div style="font-size:12px">Director, Parallel Computing Institute</div><div style="font-size:12px">Deputy Director for Research</div><div style="font-size:12px">Institute for Advanced Computing Applications and Technologies</div>

<div style="font-size:12px">Paul and Cynthia Saylor Professor of Computer Science</div><div style="font-size:12px">University of Illinois Urbana-Champaign</div></div><div><br></div></span><br>

</div>

<br></font></span></div></div></div><br>_______________________________________________<br>

mpi3-ft mailing list<br>

<a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>

<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br></blockquote></div><br><br clear="all"><div><br></div>-- <br>Joshua Hursey<br>

Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>