[Mpi3-ft] simplified FT proposal

Josh Hursey jjhursey at open-mpi.org
Tue Jan 17 15:35:06 CST 2012


The next teleconf is tomorrow at noon. I'm sendin gout the announcement
shortly.

-- Josh

On Tue, Jan 17, 2012 at 4:21 PM, Anthony Skjellum <tony at cis.uab.edu> wrote:

> Josh, I just sent you a note, agreeing to do a writeup.  I need a few
> days.  When is next telecon?
>
> Tony
>
>
> ----- Original Message -----
> From: "Josh Hursey" <jjhursey at open-mpi.org>
> To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" <
> mpi3-ft at lists.mpi-forum.org>
> Sent: Tuesday, January 17, 2012 3:15:13 PM
> Subject: Re: [Mpi3-ft] simplified FT proposal
>
>
> Tony,
>
>
> I have to admit that I am getting a bit lost in your multipart
> presentation of your proposal. So that is making it difficult for me to
> reason about it as a whole. Can you synthesis your proposal (either as a
> separate email thread or on the wiki) and maybe we can discuss it further
> on the teleconf and mailing list? That will give us a single place to point
> to and refine during the subsequent discussions.
>
>
> Thanks,
> Josh
>
>
> On Mon, Jan 16, 2012 at 4:18 PM, Anthony Skjellum <
> tony at runtimecomputing.com > wrote:
>
>
>
> More
>
>
> #5)
>
>
> A parallel thread, running on the same group as the communicator used for
> the user program in FTBLOCK, would be free to be running a gossip protocol,
> instantiated
> a) by the user at his/her choice
> b) by MPI at its choice, by user demand
>
>
> This parallel inspector thread propagates error state, and accepts input
> to the error state by allowing user programs to assert error state within
> the FTBLOCK.
>
>
> You can think of this parallel thread as cooperation of the progress
> engines with gossip, or as a completely optional activity, created by the
> user, or as
> a part of the FT proposal. As long as there are local ways to add error
> state, we can write this as a layered library for the sake of making sure
> that local
> errors are propagated (remembering that MPI is returning local errors to
> the user inside his/her FTBLOCK).
>
>
> If we want the implementation to provide/support this kind of FT-friendly
> error propagation, that is a bigger step, which I am not advocating as
> required.
> I think it would be nice to have this be part of a fault tolerant
> communication space. But, I am OK with users doing this for themselves,
> esp. because they
> can control granularity better.
>
>
> #6) I think MPI_COMM_WORLD is a bummer in the FT world. We don't want it
> to hang around very long after we get going. If we are really working on
> subgroups, and these subgroups form a hierarchical graph, rather than an
> all-to-all virtual topology, I don't want to reconstruct MPI_COMM_WORLD
> after
> the first process breaks. So - as an example of pain of MPI scalability -
> the build-down model of MPI-1 is less good for this than a build-up
> model, where we find a way for groups to rendezvous. Obviously, for
> convenience, the all-to-all virtual topology in the onset of MPI_Init() is
> nice,
> but I am assuming that errors may happen quite quickly.
>
>
> MPI_Init() with no failures is only good for a limited window of time,
> given failure rates. During this time, we would like to "turn off
> MPI_COMM_WORLD"
> unless we really don't need it, or at least not have to reconstruct it
> ever if we don't need it.
>
>
> So, we need to agree on a fault-friendly MPI_Init()...
>
>
> An FTBLOCK-like idea surrounding MPI_Init(), where the external rules of
> spawning MPI, or the spawn command that creates the world, create an
> effective communicator. We should generalize MPI_Init() to support outcomes
> other than pure success, such as partial success (e.g., smaller world than
> stipulated).
>
>
> A really good approach would be to only create the actual communicators
> you really need to run your process, not the virtual-all-to-all world of
> MPI-1 and MPI-2.
> But, that is a heresy, so forgive me.
>
>
> Tony
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Mon, Jan 16, 2012 at 2:46 PM, Anthony Skjellum <
> tony at runtimecomputing.com > wrote:
>
>
> Satantan,
>
>
> Here are my further thoughts, which hopefully includes an answer to your
> questions :-)
>
>
> 1) Everything is set to "ERRORS return"; the goal is to get local errors
> back if they are available.
>
>
> 2) I would emphasize non-blocking operations, but blocking operations
> implemented with internal timeout could return a timeout-type error
>
>
> 3) You don't have to return the same return code, or the same results in
> all processes in the communicator, you can get erroneous results
> or local failures; the functions are also allowed to produce incorrect
> results [and we should then discuss what error reporting means here...
> I am happy with local errors returned where known, recognizing those
> processes may die before the bottom of the block. However, I also
> expect the implementation to do its best to propagate error state
> knowledge within this FTBLOCK, based organically on ongoing communication
> or on gossip if an implementation so chose.]
>
>
> Also, because we assume that there is also algorithmic fault tolerance at
> work, local errors may be raised by the application because it is doing
> checks
> for validity etc.
>
>
> So, either MPI or the application may raise local errors prior to the
> bottom of the FTBLOCK, and the success of the bottom of the block must be
> allowed
> to be fail, just based on ABFT inputs from the application to MPI, not
> just based on MPI's opinion.
>
>
> 4) If you are willing to do everything nonblocking, then I can describe
> the test at the bottom of the FTBLOCK as follows:
>
>
> The test operation at the bottom of the FTBLOCK is effectively a
> generalized WAIT_ALL, that completes or doesn't complete all the outstanding
> requests, returning errors related to the faults observed, and providing a
> unified 0/1 success failure state consistently across the group of comm
> [or surviving members thereof].
>
>
> In my view, the application as well as MPI can contribute error state as
> input to the FTBLOCK test.
>
>
> Also, the application that gets local errors inside the loop already is
> immediately ready to punt at that point, and do a jump to the bottom of the
> loop. Let's
> assume it is required to do all the operations before getting to the
> bottom of the loop, for now, and we just allow that some of these may
> return further
> errors (I am trying to keep it simple), and MPI-like rules of all
> attempting the operations. If we can get this to work, we can weaken it
> later.
>
>
> There is no good way to describe some BLOCKING, some nonblocking
> operations, because we have no descriptor to tell us if something failed
> that previously
> returned... and did not give a local error, so I am not going to pursue
> BLOCKING for now. Let's assume we cannot do BLOCKING, and weaken this later,
> if we can get a consistent solution using all nonblocking.
>
>
> Please tell me what you think.
>
>
> Thanks for responding!
>
>
> Tony
>
>
>
>
>
>
> On Mon, Jan 16, 2012 at 12:39 PM, Sur, Sayantan < sayantan.sur at intel.com> wrote:
>
>
>
>
>
>
> Hi Tony,
>
>
>
> In the example semantics you mentioned, are the “ops” required to return
> the same result on all processors? Although this doesn’t change the API
> “op”, but it does change completion semantics of almost all MPI ops. I hope
> I am correctly interpreting your message.
>
>
>
>
> Thanks.
>
>
>
> ===
>
> Sayantan Sur, Ph.D.
>
> Intel Corp.
>
>
>
>
>
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:
> mpi3-ft-bounces at lists.mpi-forum.org ] On Behalf Of Anthony Skjellum
> Sent: Sunday, January 15, 2012 7:06 PM
>
>
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] simplified FT proposal
>
>
>
>
>
>
>
> Everyone, I think we need to start from scratch.
>
>
>
>
>
> We should look for minimal fault-tolerant models that are achievable and
> useful. They may allow for a combination of faults (process and network),
> but in the end, as discussed in San Jose:
>
>
>
>
>
> FTBLOCK
>
>
> --------------
>
>
> Start_Block(comm)
>
>
>
>
>
> op [normal MPI operation on communicator specified by Start_Block, or
> subset thereof]
>
>
> op
>
>
> op
>
>
>
>
>
> Test_Block(comm)
>
>
>
>
>
> Which either succeeds, or fails, on the whole list of operations, followed
> by ways to reconstruct communicators and add back processes (easily),
>
>
> provides for 3-level fault tolerant model
>
>
>
>
>
> a) Simply retry if the kind of error at the Test_Block is retryable
>
>
> b) Simply reconstruct the communicator, use algorithmic fault tolerance to
> get lost data, and retry the block
>
>
> c) drop back to 1 or more levels of checkpoint-restart.
>
>
>
>
>
> We can envision in this model an unrolling of work, in terms of a
> parameter N, if there is a lot of vector work, to allow granularity control
>
>
> as a function of fault environment.
>
>
>
>
>
> In some sense, a simpler model, that provides for efforts to detect
>
>
> a) by MPI
>
>
> b) allowing application monitor to assert failure asynchronously to this
> loop
>
>
>
>
>
> provides a more general ability to have coverage of faults, including but
> not limited to process faults and possible network faults.
>
>
>
>
>
> It changes the API very little.
>
>
>
>
>
> It motivates the use of buffers, not zero copy, to support the fact that
> you may to roll-back a series of operations, thereby revealing fault-free
> overhead directly.
>
>
>
>
>
> Start_Block and Test_Block are collective and synchronizing, such as
> barriers. Because we have uncertainty to within a message, multiple
> barriers (as mentioned by George Bosilca K to me in a sidebar at the
> meeting).
>
>
>
>
>
> We try to get this to work, COMPLETELY, and ratify this in MPI 3.x, if we
> can. Once we have this stable intermediate form, we explore more
>
>
> options.
>
>
>
>
>
> I think it is important to recognize that the reconstruction step,
> including readding processes and making new communicators may mean smarter
> Join operations. It is clear we need to be able to treat failures during
> the recovery process, and use a second level loop, possibly bombing out
>
>
> to checkpoint, if we cannot make net progress on recovery because of
> unmodeled error issues.
>
>
>
>
>
> The testing part leverages all the learning so far, but needn't be
> restricted to modeled errors like process faults. There can be modeled and
> unmodeled faults. Based on what fault comes up, the user application then
> has to decide how hard a retry to do, whether just to add processes,
>
>
> whether just to retry the loop, whether to go to a checkpoint, whether to
> restart the APP. MPI could give advice, based on its understanding of
>
>
> the fault model, in terms of sufficient conditions for "working harder" vs
> "trying the easiest" for fault models it understands somewhat for a system.
>
>
>
>
>
> Now, the comments here are a synopsis of part of the side bars and open
> discussion we had in San Jose, distilled a bit. I want to know why
>
>
> we can't start with this, succeed with this, implement and test it, and
> having succeeded, do more in a future 3.y, y > x release, given user
> experience.
>
>
>
>
>
> I am not speaking to the choice of "killing all communicators" as with
> FT-MPI, or "just remaking those you need to remake." I think we need to
> resolve. Honestly, groups own the fault property, not communicators, and
> all groups held by communicators where the fault happened should be
> rebuilt, not all communicators... Let's argue on that.
>
>
>
>
>
> So, my suggestion is REBOOT the proposal with something along lines above,
> unless you all see this is no better.
>
>
>
>
>
> With kind regards,
>
>
> Tony
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Sun, Jan 15, 2012 at 8:00 PM, Sur, Sayantan < sayantan.sur at intel.com >
> wrote:
>
>
>
> Hi Bill,
>
>
>
> I am in agreement with your suggestion to have a collective over a
> communicator that returns a new communicator containing ranks “alive some
> point during construction”. It provides cleaner semantics. The example was
> merely trying to utilize the new MPI_Comm_create_group API that the Forum
> is considering.
>
>
>
> MPI_Comm_check provides a method to form global consensus in that all
> ranks in comm did call it. It does not imply anything about current status
> of comm, or even the status “just before” the call returns. During the
> interval before the next call to MPI_Comm_check, it is possible that many
> ranks fail. However, the app/lib using MPI knows the point where everyone
> was alive.
>
>
>
> Thanks.
>
>
>
>
>
> ===
>
> Sayantan Sur, Ph.D.
>
> Intel Corp.
>
>
>
>
>
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:
> mpi3-ft-bounces at lists.mpi-forum.org ] On Behalf Of William Gropp
> Sent: Sunday, January 15, 2012 2:41 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] simplified FT proposal
>
>
>
>
>
> One concern that I have with fault tolerant proposals has to do with races
> in the specification. This is an area where users often "just want it to
> work" but getting it right is tricky. In the example here, the
> "alive_group" is really only that at some moment shortly before
> "MPI_Comm_check" returns (and possibly not even that). After that, it is
> really the "group_of_processes_that_was_alive_at_some_point_in_the_past".
> Since there are sometimes correlations in failures, this could happen even
> if the initial failure is rare. An alternate form might be to have a
> routine, collective over a communicator, that returns a new communicator
> meeting some definition of "members were alive at some point during
> construction". It wouldn't guarantee you could use it, but it would have
> cleaner semantics.
>
>
>
>
>
>
> Bill
>
>
>
>
>
>
> On Jan 13, 2012, at 3:41 PM, Sur, Sayantan wrote:
>
>
>
>
> I would like to argue for a simplified version of the proposal that covers
> a large percentage of use-cases and resists adding new “features” for the
> full-range of ABFT techniques. It is good if we have a more pragmatic view
> and not sacrifice the entire FT proposal for the 1% fringe cases. Most apps
> just want to do something like this:
>
>
>
>
>
> for(… really long time …) {
>
>
> MPI_Comm_check(work_comm, &is_ok, &alive_group);
>
>
> if(!is_ok) {
>
>
> MPI_Comm_create_group(alive_group, …, &new_comm);
>
>
> // re-balance workload and use new_comm in rest of computation
>
>
> MPI_Comm_free(work_comm); // get rid of old comm
>
>
> work_comm = new_comm;
>
>
> } else {
>
>
> // continue computation using work_comm
>
>
> // if some proc failed in this iteration, roll back work done in this
> iteration, go back to loop
>
>
> }
>
>
> }
>
>
>
>
>
>
>
>
>
> William Gropp
>
>
> Director, Parallel Computing Institute
>
>
> Deputy Director for Research
>
>
> Institute for Advanced Computing Applications and Technologies
>
>
> Paul and Cynthia Saylor Professor of Computer Science
>
>
> University of Illinois Urbana-Champaign
>
>
>
>
>
>
>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
>
>
>
>
>
>
> --
> Tony Skjellum, PhD
> RunTime Computing Solutions, LLC
> tony at runtimecomputing.com
> direct: +1-205-314-3595
> cell: +1-205-807-4968
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
>
>
> --
> Tony Skjellum, PhD
> RunTime Computing Solutions, LLC
> tony at runtimecomputing.com
> direct: +1-205-314-3595
> cell: +1-205-807-4968
>
>
>
>
>
> --
> Tony Skjellum, PhD
> RunTime Computing Solutions, LLC
> tony at runtimecomputing.com
> direct: +1-205-314-3595
> cell: +1-205-807-4968
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> --
> Anthony Skjellum, PhD
> Professor and Chair
> Dept. of Computer and Information Sciences
> Director, UAB Center for Information Assurance and Joint Forensic Research
> ("The Center")
> University of Alabama at Birmingham
> +1-(205)934-8657; FAX: +1- (205)934-5473
>
> ___________________________________________
> CONFIDENTIALITY: This e-mail and any attachments are confidential and
> may be privileged. If you are not a named recipient, please notify the
> sender immediately and do not disclose the contents to another person,
> use it for any purpose or store or copy the information in any medium.
>
> Please consider the environment before printing this e-mail
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>


-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120117/cd62b933/attachment-0001.html>


More information about the mpiwg-ft mailing list