[Mpi3-ft] simplified FT proposal

Josh Hursey jjhursey at open-mpi.org
Tue Jan 17 15:15:13 CST 2012


Tony,

I have to admit that I am getting a bit lost in your multipart presentation
of your proposal. So that is making it difficult for me to reason about it
as a whole. Can you synthesis your proposal (either as a separate email
thread or on the wiki) and maybe we can discuss it further on the teleconf
and mailing list? That will give us a single place to point to and refine
during the subsequent discussions.

Thanks,
Josh

On Mon, Jan 16, 2012 at 4:18 PM, Anthony Skjellum <tony at runtimecomputing.com
> wrote:

> More
>
> #5)
>
> A parallel thread, running on the same group as the communicator used for
> the user program in FTBLOCK, would be free to be running a gossip protocol,
> instantiated
> a) by the user at his/her choice
> b) by MPI at its choice, by user demand
>
> This parallel inspector thread propagates error state, and accepts input
> to the error state by allowing user programs to assert error state within
> the FTBLOCK.
>
> You can think of this parallel thread as cooperation of the progress
> engines with gossip, or as a completely optional activity, created by the
> user, or as
> a part of the FT proposal.  As long as there are local ways to add error
> state, we can write this as a layered library for the sake of making sure
> that local
> errors are propagated (remembering that MPI is returning local errors to
> the user inside his/her FTBLOCK).
>
> If we want the implementation to provide/support this kind of FT-friendly
> error propagation, that is a bigger step, which I am not advocating as
> required.
> I think it would be nice to have this be part of a fault tolerant
> communication space.  But, I am OK with users doing this for themselves,
> esp. because they
> can control granularity better.
>
> #6) I think MPI_COMM_WORLD is a bummer in the FT world.  We don't want it
> to hang around very long after we get going.  If we are really working on
> subgroups, and these subgroups form a hierarchical graph, rather than an
> all-to-all virtual topology, I don't want to reconstruct MPI_COMM_WORLD
> after
> the first process breaks.  So - as an example of pain of MPI scalability -
> the build-down model of MPI-1 is less good for this than a build-up
> model, where we find a way for groups to rendezvous.  Obviously, for
> convenience, the all-to-all virtual topology in the onset of MPI_Init() is
> nice,
> but I am assuming that errors may happen quite quickly.
>
> MPI_Init() with no failures is only good for a limited window of time,
> given failure rates.  During this time, we would like to "turn off
> MPI_COMM_WORLD"
> unless we really don't need it, or at least not have to reconstruct it
> ever if we don't need it.
>
> So, we need to agree on a fault-friendly MPI_Init()...
>
> An FTBLOCK-like idea surrounding MPI_Init(), where the external rules of
> spawning MPI, or the spawn command that creates the world, create an
> effective communicator.  We should generalize MPI_Init() to support
> outcomes other than pure success, such as partial success (e.g., smaller
> world than stipulated).
>
> A really good approach would be to only create the actual communicators
> you really need to run your process, not the virtual-all-to-all world of
> MPI-1 and MPI-2.
> But, that is a heresy, so forgive me.
>
> Tony
>
>
>
>
>
>
>
> On Mon, Jan 16, 2012 at 2:46 PM, Anthony Skjellum <
> tony at runtimecomputing.com> wrote:
>
>> Satantan,
>>
>> Here are my further thoughts, which hopefully includes an answer to your
>> questions :-)
>>
>> 1) Everything is set to "ERRORS return"; the goal is to get local errors
>> back if they are available.
>>
>> 2) I would emphasize non-blocking operations, but blocking operations
>> implemented with internal timeout could return a timeout-type error
>>
>> 3) You don't have to return the same return code, or the same results in
>> all processes in the communicator, you can get erroneous results
>>     or local failures; the functions are also allowed to produce
>> incorrect results [and we should then discuss what error reporting means
>> here...
>>     I am happy with local errors returned where known, recognizing those
>> processes may die before the bottom of the block.  However, I also
>>     expect the implementation to do its best to propagate error state
>> knowledge within this FTBLOCK, based organically on ongoing communication
>>     or on gossip if an implementation so chose.]
>>
>>    Also, because we assume that there is also algorithmic fault tolerance
>> at work, local errors may be raised by the application because it is doing
>> checks
>>    for validity etc.
>>
>>   So, either MPI or the application may raise local errors prior to the
>> bottom of the FTBLOCK, and the success of the bottom of the block must be
>> allowed
>>   to be fail, just based on ABFT inputs from the application to MPI, not
>> just based on MPI's opinion.
>>
>> 4) If you are willing to do everything nonblocking, then I can describe
>> the test at the bottom of the FTBLOCK as follows:
>>
>> The test operation at the bottom of the FTBLOCK is effectively a
>> generalized WAIT_ALL, that completes or doesn't complete all the outstanding
>> requests, returning errors related to the faults observed, and providing
>> a unified 0/1 success failure state consistently across the group of comm
>> [or surviving members thereof].
>>
>> In my view, the application as well as MPI can contribute error state as
>> input to the FTBLOCK test.
>>
>> Also, the application that gets local errors inside the loop already is
>> immediately ready to punt at that point, and do a jump to the bottom of the
>> loop.  Let's
>> assume it is required to do all the operations before getting to the
>> bottom of the loop, for now, and we just allow that some of these may
>> return further
>> errors (I am trying to keep it simple), and MPI-like rules of all
>> attempting the operations.  If we can get this to work, we can weaken it
>> later.
>>
>> There is no good way to describe some BLOCKING, some nonblocking
>> operations, because we have no descriptor to tell us if something failed
>> that previously
>> returned... and did not give a local error, so I am not going to pursue
>> BLOCKING for now.  Let's assume we cannot do BLOCKING, and weaken this
>> later,
>> if we can get a consistent solution using all nonblocking.
>>
>> Please tell me what you think.
>>
>> Thanks for responding!
>>
>> Tony
>>
>>
>> On Mon, Jan 16, 2012 at 12:39 PM, Sur, Sayantan <sayantan.sur at intel.com>wrote:
>>
>>>  Hi Tony,****
>>>
>>> ** **
>>>
>>> In the example semantics you mentioned, are the “ops” required to return
>>> the same result on all processors? Although this doesn’t change the API
>>> “op”, but it does change completion semantics of almost all MPI ops. I hope
>>> I am correctly interpreting your message.****
>>>
>>> ** **
>>>
>>> Thanks.****
>>>
>>> ** **
>>>
>>> ===****
>>>
>>> Sayantan Sur, Ph.D.****
>>>
>>> Intel Corp.****
>>>
>>> ** **
>>>
>>> *From:* mpi3-ft-bounces at lists.mpi-forum.org [mailto:
>>> mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Anthony Skjellum
>>> *Sent:* Sunday, January 15, 2012 7:06 PM
>>>
>>> *To:* MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>> *Subject:* Re: [Mpi3-ft] simplified FT proposal****
>>>
>>>  ** **
>>>
>>> Everyone, I think we need to start from scratch.****
>>>
>>> ** **
>>>
>>> We should look for minimal fault-tolerant models that are achievable and
>>> useful.  They may allow for a combination of faults (process and network),
>>> but in the end, as discussed in San Jose:****
>>>
>>> ** **
>>>
>>> FTBLOCK****
>>>
>>> --------------****
>>>
>>> Start_Block(comm)****
>>>
>>> ** **
>>>
>>> op [normal MPI operation on communicator specified by Start_Block, or
>>> subset thereof]****
>>>
>>> op****
>>>
>>> op****
>>>
>>> ** **
>>>
>>> Test_Block(comm)****
>>>
>>> ** **
>>>
>>> Which either succeeds, or fails, on the whole list of operations,
>>> followed by ways to reconstruct communicators and add back processes
>>> (easily),****
>>>
>>> provides for 3-level fault tolerant model****
>>>
>>> ** **
>>>
>>> a) Simply retry if the kind of error at the Test_Block is retryable****
>>>
>>> b) Simply reconstruct the communicator, use algorithmic fault tolerance
>>> to get lost data, and retry the block****
>>>
>>> c) drop back to 1 or more levels of checkpoint-restart.****
>>>
>>> ** **
>>>
>>> We can envision in this model an unrolling of work, in terms of a
>>> parameter N, if there is a lot of vector work, to allow granularity control
>>> ****
>>>
>>> as a function of fault environment.****
>>>
>>> ** **
>>>
>>> In some sense, a simpler model, that provides for efforts to detect****
>>>
>>> a) by MPI****
>>>
>>> b) allowing application monitor to assert failure asynchronously to this
>>> loop****
>>>
>>> ** **
>>>
>>> provides a more general ability to have coverage of faults, including
>>> but not limited to process faults and possible network faults.****
>>>
>>> ** **
>>>
>>> It changes the API very little.****
>>>
>>> ** **
>>>
>>> It motivates the use of buffers, not zero copy, to support the fact that
>>> you may to roll-back a series of operations, thereby revealing fault-free
>>> overhead directly.****
>>>
>>> ** **
>>>
>>> Start_Block and Test_Block are collective and synchronizing, such as
>>> barriers.  Because we have uncertainty to within a message, multiple
>>> barriers (as mentioned by George Bosilca K to me in a sidebar at the
>>> meeting).****
>>>
>>> ** **
>>>
>>> We try to get this to work, COMPLETELY, and ratify this in MPI 3.x, if
>>> we can.  Once we have this stable intermediate form, we explore more****
>>>
>>> options.****
>>>
>>> ** **
>>>
>>> I think it is important to recognize that the reconstruction step,
>>> including readding processes and making new communicators may mean smarter
>>> Join operations.  It is clear we need to be able to treat failures during
>>> the recovery process, and use a second level loop, possibly bombing out*
>>> ***
>>>
>>> to checkpoint, if we cannot make net progress on recovery because of
>>> unmodeled error issues.****
>>>
>>> ** **
>>>
>>> The testing part leverages all the learning so far, but needn't be
>>> restricted to modeled errors like process faults.  There can be modeled and
>>> unmodeled faults.  Based on what fault comes up, the user application then
>>> has to decide how hard a retry to do, whether just to add processes,****
>>>
>>> whether just to retry the loop, whether to go to a checkpoint, whether
>>> to restart the APP.  MPI could give advice, based on its understanding of
>>> ****
>>>
>>> the fault model, in terms of sufficient conditions for "working harder"
>>> vs "trying the easiest" for fault models it understands somewhat for a
>>> system.****
>>>
>>> ** **
>>>
>>> Now, the comments here are a synopsis of part of the side bars and open
>>> discussion we had in San Jose, distilled a bit.  I want to know why ****
>>>
>>> we can't start with this, succeed with this, implement and test it, and
>>> having succeeded, do more in a future 3.y, y > x release, given user
>>> experience.****
>>>
>>> ** **
>>>
>>> I am not speaking to the choice of "killing all communicators" as with
>>> FT-MPI, or "just remaking those you need to remake."  I think we need to
>>> resolve.  Honestly, groups own the fault property, not communicators, and
>>> all groups held by communicators where the fault happened should be
>>> rebuilt, not all communicators...  Let's argue on that.****
>>>
>>> ** **
>>>
>>> So, my suggestion is REBOOT the proposal with something along lines
>>> above, unless you all see this is no better.****
>>>
>>> ** **
>>>
>>> With kind regards,****
>>>
>>> Tony****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> On Sun, Jan 15, 2012 at 8:00 PM, Sur, Sayantan <sayantan.sur at intel.com>
>>> wrote:****
>>>
>>> Hi Bill,****
>>>
>>>  ****
>>>
>>> I am in agreement with your suggestion to have a collective over a
>>> communicator that returns a new communicator containing ranks “alive some
>>> point during construction”. It provides cleaner semantics. The example was
>>> merely trying to utilize the new MPI_Comm_create_group API that the Forum
>>> is considering.****
>>>
>>>  ****
>>>
>>> MPI_Comm_check provides a method to form global consensus in that all
>>> ranks in comm did call it. It does not imply anything about current status
>>> of comm, or even the status “just before” the call returns. During the
>>> interval before the next call to MPI_Comm_check, it is possible that many
>>> ranks fail. However, the app/lib using MPI knows the point where everyone
>>> was alive.****
>>>
>>>  ****
>>>
>>> Thanks.****
>>>
>>>  ****
>>>
>>> ===****
>>>
>>> Sayantan Sur, Ph.D.****
>>>
>>> Intel Corp.****
>>>
>>>  ****
>>>
>>> *From:* mpi3-ft-bounces at lists.mpi-forum.org [mailto:
>>> mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *William Gropp
>>> *Sent:* Sunday, January 15, 2012 2:41 PM
>>> *To:* MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>> *Subject:* Re: [Mpi3-ft] simplified FT proposal****
>>>
>>>  ****
>>>
>>> One concern that I have with fault tolerant proposals has to do with
>>> races in the specification.  This is an area where users often "just want
>>> it to work" but getting it right is tricky.  In the example here, the
>>> "alive_group" is really only that at some moment shortly before
>>> "MPI_Comm_check" returns (and possibly not even that).  After that, it is
>>> really the "group_of_processes_that_was_alive_at_some_point_in_the_past".
>>>  Since there are sometimes correlations in failures, this could happen even
>>> if the initial failure is rare.  An alternate form might be to have a
>>> routine, collective over a communicator, that returns a new communicator
>>> meeting some definition of "members were alive at some point during
>>> construction".  It wouldn't guarantee you could use it, but it would have
>>> cleaner semantics.****
>>>
>>>  ****
>>>
>>> Bill****
>>>
>>>  ****
>>>
>>> On Jan 13, 2012, at 3:41 PM, Sur, Sayantan wrote:****
>>>
>>> ** **
>>>
>>> I would like to argue for a simplified version of the proposal that
>>> covers a large percentage of use-cases and resists adding new “features”
>>> for the full-range of ABFT techniques. It is good if we have a more
>>> pragmatic view and not sacrifice the entire FT proposal for the 1% fringe
>>> cases. Most apps just want to do something like this:****
>>>
>>>  ****
>>>
>>> for(… really long time …) {****
>>>
>>>    MPI_Comm_check(work_comm, &is_ok, &alive_group);****
>>>
>>>    if(!is_ok) {****
>>>
>>>        MPI_Comm_create_group(alive_group, …, &new_comm);****
>>>
>>>       // re-balance workload and use new_comm in rest of computation****
>>>
>>>        MPI_Comm_free(work_comm); // get rid of old comm****
>>>
>>>        work_comm = new_comm;****
>>>
>>>    } else {****
>>>
>>>      // continue computation using work_comm****
>>>
>>>      // if some proc failed in this iteration, roll back work done in
>>> this iteration, go back to loop****
>>>
>>>    }****
>>>
>>> }****
>>>
>>>  ****
>>>
>>>  ****
>>>
>>> William Gropp****
>>>
>>> Director, Parallel Computing Institute****
>>>
>>> Deputy Director for Research****
>>>
>>> Institute for Advanced Computing Applications and Technologies****
>>>
>>> Paul and Cynthia Saylor Professor of Computer Science****
>>>
>>> University of Illinois Urbana-Champaign****
>>>
>>>  ****
>>>
>>>  ****
>>>
>>>  ****
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft****
>>>
>>>
>>>
>>> ****
>>>
>>> ** **
>>>
>>> --
>>> Tony Skjellum, PhD
>>> RunTime Computing Solutions, LLC
>>> tony at runtimecomputing.com
>>> direct: +1-205-314-3595
>>> cell: +1-205-807-4968****
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>
>>
>>
>> --
>> Tony Skjellum, PhD
>> RunTime Computing Solutions, LLC
>> tony at runtimecomputing.com
>> direct: +1-205-314-3595
>> cell: +1-205-807-4968
>>
>>
>
>
> --
> Tony Skjellum, PhD
> RunTime Computing Solutions, LLC
> tony at runtimecomputing.com
> direct: +1-205-314-3595
> cell: +1-205-807-4968
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120117/2bb35825/attachment-0001.html>


More information about the mpiwg-ft mailing list