[Mpi3-ft] simplified FT proposal

Sun Jan 15 21:05:42 CST 2012

Everyone, I think we need to start from scratch.

We should look for minimal fault-tolerant models that are achievable and
useful.  They may allow for a combination of faults (process and network),
but in the end, as discussed in San Jose:

FTBLOCK
--------------
Start_Block(comm)

op [normal MPI operation on communicator specified by Start_Block, or
subset thereof]
op
op

Test_Block(comm)

Which either succeeds, or fails, on the whole list of operations, followed
by ways to reconstruct communicators and add back processes (easily),
provides for 3-level fault tolerant model

a) Simply retry if the kind of error at the Test_Block is retryable
b) Simply reconstruct the communicator, use algorithmic fault tolerance to
get lost data, and retry the block
c) drop back to 1 or more levels of checkpoint-restart.

We can envision in this model an unrolling of work, in terms of a parameter
N, if there is a lot of vector work, to allow granularity control
as a function of fault environment.

In some sense, a simpler model, that provides for efforts to detect
a) by MPI
b) allowing application monitor to assert failure asynchronously to this
loop

provides a more general ability to have coverage of faults, including but
not limited to process faults and possible network faults.

It changes the API very little.

It motivates the use of buffers, not zero copy, to support the fact that
you may to roll-back a series of operations, thereby revealing fault-free
overhead directly.

Start_Block and Test_Block are collective and synchronizing, such as
barriers.  Because we have uncertainty to within a message, multiple
barriers (as mentioned by George Bosilca K to me in a sidebar at the
meeting).

We try to get this to work, COMPLETELY, and ratify this in MPI 3.x, if we
can.  Once we have this stable intermediate form, we explore more
options.

I think it is important to recognize that the reconstruction step,
including readding processes and making new communicators may mean smarter
Join operations.  It is clear we need to be able to treat failures during
the recovery process, and use a second level loop, possibly bombing out
to checkpoint, if we cannot make net progress on recovery because of
unmodeled error issues.

The testing part leverages all the learning so far, but needn't be
restricted to modeled errors like process faults.  There can be modeled and
unmodeled faults.  Based on what fault comes up, the user application then
has to decide how hard a retry to do, whether just to add processes,
whether just to retry the loop, whether to go to a checkpoint, whether to
restart the APP.  MPI could give advice, based on its understanding of
the fault model, in terms of sufficient conditions for "working harder" vs
"trying the easiest" for fault models it understands somewhat for a system.

Now, the comments here are a synopsis of part of the side bars and open
discussion we had in San Jose, distilled a bit.  I want to know why
we can't start with this, succeed with this, implement and test it, and
having succeeded, do more in a future 3.y, y > x release, given user
experience.

I am not speaking to the choice of "killing all communicators" as with
FT-MPI, or "just remaking those you need to remake."  I think we need to
resolve.  Honestly, groups own the fault property, not communicators, and
all groups held by communicators where the fault happened should be
rebuilt, not all communicators...  Let's argue on that.

So, my suggestion is REBOOT the proposal with something along lines above,
unless you all see this is no better.

With kind regards,
Tony

On Sun, Jan 15, 2012 at 8:00 PM, Sur, Sayantan <sayantan.sur at intel.com>wrote:

>  Hi Bill,****
>
> ** **
>
> I am in agreement with your suggestion to have a collective over a
> communicator that returns a new communicator containing ranks “alive some
> point during construction”. It provides cleaner semantics. The example was
> merely trying to utilize the new MPI_Comm_create_group API that the Forum
> is considering.****
>
> ** **
>
> MPI_Comm_check provides a method to form global consensus in that all
> ranks in comm did call it. It does not imply anything about current status
> of comm, or even the status “just before” the call returns. During the
> interval before the next call to MPI_Comm_check, it is possible that many
> ranks fail. However, the app/lib using MPI knows the point where everyone
> was alive.****
>
> ** **
>
> Thanks.****
>
> ** **
>
> ===****
>
> Sayantan Sur, Ph.D.****
>
> Intel Corp.****
>
> ** **
>
> *From:* mpi3-ft-bounces at lists.mpi-forum.org [mailto:
> mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *William Gropp
> *Sent:* Sunday, January 15, 2012 2:41 PM
> *To:* MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> *Subject:* Re: [Mpi3-ft] simplified FT proposal****
>
> ** **
>
> One concern that I have with fault tolerant proposals has to do with races
> in the specification.  This is an area where users often "just want it to
> work" but getting it right is tricky.  In the example here, the
> "alive_group" is really only that at some moment shortly before
> "MPI_Comm_check" returns (and possibly not even that).  After that, it is
> really the "group_of_processes_that_was_alive_at_some_point_in_the_past".
>  Since there are sometimes correlations in failures, this could happen even
> if the initial failure is rare.  An alternate form might be to have a
> routine, collective over a communicator, that returns a new communicator
> meeting some definition of "members were alive at some point during
> construction".  It wouldn't guarantee you could use it, but it would have
> cleaner semantics.****
>
> ** **
>
> Bill****
>
> ** **
>
> On Jan 13, 2012, at 3:41 PM, Sur, Sayantan wrote:****
>
>
>
> ****
>
> I would like to argue for a simplified version of the proposal that covers
> a large percentage of use-cases and resists adding new “features” for the
> full-range of ABFT techniques. It is good if we have a more pragmatic view
> and not sacrifice the entire FT proposal for the 1% fringe cases. Most apps
> just want to do something like this:****
>
>  ****
>
> for(… really long time …) {****
>
>    MPI_Comm_check(work_comm, &is_ok, &alive_group);****
>
>    if(!is_ok) {****
>
>        MPI_Comm_create_group(alive_group, …, &new_comm);****
>
>       // re-balance workload and use new_comm in rest of computation****
>
>        MPI_Comm_free(work_comm); // get rid of old comm****
>
>        work_comm = new_comm;****
>
>    } else {****
>
>      // continue computation using work_comm****
>
>      // if some proc failed in this iteration, roll back work done in this
> iteration, go back to loop****
>
>    }****
>
> }****
>
>  ****
>
> ** **
>
> William Gropp****
>
> Director, Parallel Computing Institute****
>
> Deputy Director for Research****
>
> Institute for Advanced Computing Applications and Technologies****
>
> Paul and Cynthia Saylor Professor of Computer Science****
>
> University of Illinois Urbana-Champaign****
>
> ** **
>
> ** **
>
> ** **
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>

-- 
Tony Skjellum, PhD
RunTime Computing Solutions, LLC
tony at runtimecomputing.com
direct: +1-205-314-3595
cell: +1-205-807-4968
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120115/1fc0b6fd/attachment-0001.html>