[Mpi3-ft] MPI Fault Tolerance scenarios

Fri Feb 27 05:49:43 CST 2009

Dear Erez,

Thank you. I reply to your notes below in detail, but I want to say a couple of words upfront: I'm not trying to say that the proposed interface cannot cover this or that particular use case. Moreover, I see substantial value in the attempt to identify a minimum, intrinsically unambiguous, and elegant set of elementary calls that could be used to express certain FT concepts.

What I'm trying to say is that making a programmer care of whether a call failed or not after every MPI call is certainly a way to make people wrap all MPI calls into their own FT-tolerant wrappers that will do the job or analyzing the error codes, calling those or these recovery calls, etc.

On this backdrop of expected user behavior, we may just as well resort to the existing error handling mechanism and make it fix faults according to a certain policy. Whether or not those error handlers have to use particular API calls is open.

Note also that by introducing a certain API we restrict the set of options available to the MPI implementation that wants to implement certain FT policies. In the proposed approach, it will have to have those API calls, which may not be natural for the platform involved.

What I'm proposing is:

- Define a standard set of FT error handlers for a defined set of policies (see, e.g., Bronis' proposal aired at the telecon: nothing, local MPI calls, pt2pt w/o affected process, coll w/o affected process, pt2pt w/ affected process, coll w/ affected process)
- Provide a prototype implementation of those handlers in terms of the proposed API
- Allow implementors to provide error handlers that do not use this API, if they find a better way of fixing particular FT issues up
- Allow users to create their own error handlers using the proposed API, to which end the API should be functional inside an error handler

In some sense, the crossroads we're standing on now resemble the situation with MPI-IO back then: everybody was using parallel file I/O modes, with some success, and then MPI-IO came along with a pretty involved, multilevel datatype description. Although flexible in the utmost sense, this representation effectively complicated the I/O implementor's job, at least for starters, and now, about 15 years since its inception, we still see very interesting papers that describe how, basically, implement all those well known parallel I/O modes efficiently when datatypes look so or so. This is a good and challenging programming task, sure. Is this what users actually need? I doubt it. Moreover, I bet that should someone come up and propose a higher level I/O abstraction that would replace the current I/O interface with something easy to understand, parallel I/O would suddenly become a commodity rather than a rarely seen beast it is now, at least in the commercial ISV space.

Note that I'm not trying to rubbish the venerable MPI-IO. What I'm proposing is that instead of falling into the same trap, we should look at MPI-IO and think of the consequences of what we are doing. Then, while analyzing the possible elementary interface, we should still keep in mind that 99% of the users won't deal with it if provided with a simple and easy to comprehend set of standard FT error handlers. And believe me, they won't care about how they are implemented as long as they work as expected. And this in turn will increase the chances for MPI FT to be used in the field, and thus pass the ultimate usability test I mentioned elsewhere.

A part of the discussion background is that we still has not settled, as a group, whether MPI-3 is going to be the interface for high-end HPC only, or also for a wider and wider masses of programmers. Making things complicated favors high end HPC. Making things easy to use favors mass market. Where is MPI future? Let's think about that, make the decision, and then implement it. Otherwise we'll be always oscillating between high end and mass market, and the resulting solution will most likely please no-one.

Best regards.

Alexander

________________________________
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Thursday, February 26, 2009 7:21 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios

Thanks Alexander, please see inline...

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Supalov, Alexander
Sent: Wednesday, February 25, 2009 2:20 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios

Thanks. What error handler will be in effect when the MPI_Comm_Irepair is called? If it's MPI_ERRORS_ARE_FATAL, an error will terminate the application. If it's MPI_ERRORS_RETURN, one will either have to make MPI_Comm_Irepair and some other proposed FT calls an exception compared to the rest of the MPI standard, or explicitly set the error handler on the respective comm to MPI_ERRORS_RETURN.
[erezh] This server side code sets the error handler to MPI_ERROR_RETURNS. I don't see the need for any special case in this API, or am I missing something?
In generate if your application is FT you should not have MPI_ERRORS_ARE_FATAL as your error handler as It beats the purpose of FT app.

OK. I think we just need to introduce more error handlers in this case, see below.

And if one anyway has to set the error handler to something else that MPI_ERRORS_ARE_FATAL, why not delegating this whole comm restoration stuff to the error handler? The handler could call the MPI_Comm_Irepair or anything else required, and either return control to the application on success, or abort it using MPI_Abort on the respective communicator.
[erezh] that is possible to handle to error in your error handler. However this makes the error handler code synchronous and would block the app until the failed process is restarted. This might take some time, which is why I used the async version of repair in the master's code, hence using Irepair and not repair.
Two more things though, one I don't want to abort in the case that I could not respawn the worker. I rather just shrink the number of workers. I don't think it's possible to do that in an error handler. Second, I'd like to reassign the work to another worker. I can't do that if the error is detected when the receive is posted, and anyhow the error handler might be called twice in this case, once during the send and once during the receive. I think that would have complicated the code to distinguish if a repair was already invoked.

What would complicate the code is the necessity to do FT due diligence after each and every MPI call, which appears to be the order of the day. Also, by the time you have a fault, you're probably not so much concerned about performance as about job survival. So, paying a little overhead for the process startup appears to be a reasonable tradeoff here.

Also, note that waiting for the process to be re-created also resolves the problem of a process being unavailable for a spell. While the error handler is active, and this can easily be done in a reentrant manner, this process simply won't be available.

The shrinking and reassignment can just as well be done either from inside the error handler or, if that's too cumbersome, by setting a certain variable there and allowing the main control loop of the master program reassign the piece of work as appropriate.

By defining several standard error handlers for various fault tolerance policies we could implement what Bronis was proposing I guess. Note that these error handlers could either use the proposed low level API, or do something implementation specific do achieve their stated goal.
[erezh] I think that it depends very much on the application. For some it might work, but for others I don't think it would.
If you take this simple scenario code, I don't think it would be simple as you suggest. The application need to be notified when a repair failed and need to decide to continue and run with less workers. I think that would be hard to accomplish with pre-canned solutions.

See above. Error handler can either do this or notify the application accordingly.

Also, different implementations could have different sets of provided standard error handlers, and thus vary the level of the fault tolerance support provided, which is what I'd be looking for. This includes providing no FT support at all by defining no special FT error handlers, thus making the FT stuff completely orthogonal to the rest of the standard.
[erezh] I think that at the moment we're trying to understand the primitives required for FT, rather that jump into defining everyone favorite programming model. By being explicit we understand the primitives and constructs needed. Once we done that, we can go and recan the model into some fancy API/set of fags/...

I'm concerned that while we're trying to understand primitives in this particular or another scenario, we may be missing the whole point of FT: people may not want to deal with this stuff explicitly at every particular place of the program. They want reliable connections, and a chance to react to occasional unreliability.

Whether or not one has to use that or this interface for this is an open question. Delegating the FT stuff to the error handlers that might use some internal mechanisms different from those envisioned currently might help to make the standard more flexible and amenable to future architectural advances.

And finally, this would indeed make FT transparent to the application. The proposed use of the low level API litters the application code with error
checks and reaction to every possible error condition in the application code. This does not seem to be the right way.
[erezh]  I think we had the general agreement in the FT workgroup that we can NOT make FT transparent, nether this is a goal for the WG.

In any case, shouldn't you check for returned error code in your app anyway? Otherwise you run into corrupted data bugs. (or you don't care?)
I can think of a nicer programming model where instead of returned errors you get exception and you have your error handling routines dealing with the errors... oh wait we have that with the C++ bindings... :)

For that to happen, the proposed routines should be functional inside error handlers. I think this is a strong requirement that should be added to the current spec.

The proposed error handling approach is backward compatible, flexible, elegant, and simple. Thoughts?
[erezh] do you mean you get FT for free?
The current explicit API's are backward compatible. They have zero affect on existing applications. I think that the model we are driving toward is flexible, elegant and simple... plus preferment. You pay extra only when you care about FT and even that can be amortized.

I think that at the moment you only see a fraction of the model. With few more scenarios I think that you'll be able to see the complexity of the problem, plus the beauty of the solution we're working towards.

Let's see that first.

________________________________
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Wednesday, February 25, 2009 5:45 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
MPI_Comm_Irepair

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Supalov, Alexander
Sent: Wednesday, February 25, 2009 8:40 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios

Thanks. What call is taking care of respawning? This is not crystal clear from the scenario I'm afraid.

________________________________
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Wednesday, February 25, 2009 5:35 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
Hi Alexander,

As explained in the intro section, we are building the scenarios bottom up to create solid foundations for the FT architecture. However you are right and the scenarios will expand to use collectives and other communicators, just no in this first scenario.

This scenario is very very simple, and very very basic, to show the absolute minimum required for process FT. It only uses point-to-point communication over com world. The workers are stateless and can be respawned without any need for state recovery.

Thanks,
.Erez

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Supalov, Alexander
Sent: Wednesday, February 25, 2009 6:25 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios

Dear  Erez,

Thank you. A couple of questions:

1. You seem to restrict communication to pt2pt only. Why? A Bcast upfront could be useful, for one.
2. I can imagine more complicated communicator combinations than only MPI_COMM_WORLD. Why do we require one communicator?
3. It appears that failed slaves cannot be simply respawned. Is this what a repair would do anyway?

Best regards.

Alexander

________________________________
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Wednesday, February 18, 2009 3:53 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] MPI Fault Tolerance scenarios
Hello all,

In our last meeting we decided to build a set of FT scenarios/programs to help us understand the details of the interface need to support those scenarios. We also decided to start with very simple scenarios and add more complex ones as we understand the former better.  I hope that starting with simple scenarios will help us build a solid foundation on which we can build the more complex solutions.

When we build an FT solution we will focus on the scenario as described, without complicating the solution just because it would be needed later for a more complex one. The time will come later to modify the solution as we acquire more knowledge and built the foundations. Hence, any proposal or change that we make needs to fit exactly the scenario (and all those that we previously looked at) but no more.
For example in the first scenario that we'll look at there is no need for saving communicator state or error callback; but they might be required later.

Note that these scenarios focus on process FT rather than checkpoint/restart or network degradation. I assume we'll do the latter later.
Scenario #1: Very Simple Master-Workers
Description
This is a very simple master-workers scenario. However simple, we were asked many times by customers to support FT in this scenario.
In this case the MPI application running with n processes, where rank 0 is used as the master and n-1 ranks are used as workers.  The master generates work (either by getting it directly from user input, or reading a file) and sends it for processing to a free worker rank. The master sends requests and receives replies using MPI point-to-point communication.  The workers wait for the incoming message, upon arrival the worker computes the result and sends it back to the master.  The master stores the result to a log file.

Hardening: The goal is to harden the workers, the master itself is not FT, thus if it fails the entire application fails. In this case the workers are FT, and are replaced to keep computation power for this application. (a twist: if a worker cannot be recovered the master can work with a smaller set of clients up to a low watermark).
Worker
The worker waits on a blocking receive when a message arrives it process it. If a done message arrives the worker finalizes MPI and exit normally.

Hardening: There is not special requirement for hardening here. If the worker encounters a communication problem with the master, it means that the master is down and it's okay to abort the entire job. Thus, it will use the default error handler (which aborts on errors).  Note that we do not need to modify the client at all to make the application FT (except the master).

Pseudo code for the hardened worker:
int main()
{
    MPI_Init()

    for(;;)
    {
        MPI_Recv(src=0, &query, MPI_COMM_WORLD);
        if(is_done_msg(query))
            break;

        process_query(&query, &answer);

        MPI_Send(dst=0, &answer, MPI_COMM_WORLD);
    }

    MPI_Finalize()
}

Notice that for this FT code there is no requirement for the worker to rejoin the comm. As the only communicator used is MPI_COMM_WORLD.

Master
The master code reads queries from a stream and passes them on to the workers to process. The master goes through several phases. In the initialization phase it sends the first request to each one of the ranks; in the second one it shuts down any unnecessary ranks (if the job is too small); I the third phase it enters its progress engine where it handles replies (answers), process recovery and termination (on input end).

Hardening: It is the responsibility of the master to restart any failing workers and make sure that the request (query) did not get lost if a worker fails. Hence, every time an error is detected the master will move the worker into repairing state and move its workload to other workers.
The master runs with errors returned rather than aborted

One thing to note about the following code: it is not optimized. I did not try to overlap computation with communication (which is possible) I tried to keep it as simple as possible for the purpose of discussion.

Pseudo code for the hardened master; the code needed for repairing the failed ranks is highlighted in yellow.
int main()
{
    MPI_Init()
>>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    MPI_Comm_size(MPI_COMM_WORLD, &n);

    MPI_Request r[n] = MPI_REQUEST_NULL;
    QueryMessage q[n];
    AnswerMessage a[n];
    int active_workers = 0;

>>> bool repairing[n] = false;

    //
    // Phase 1: send initial requests
    //
    for(int i = 1; i < n; i++)
    {
        if(get_next_query(stream, &q[i]) == eof)
            break;

        active_workers++;
        MPI_Send(dest=i, &q[i], MPI_COMM_WORLD);
        rc = MPI_Irecv(src=i, buffer=&a[x], request=&r[x], MPI_COMM_WORLD)
>>>     if(rc != MPI_SUCCESS)
>>>     {
>>>         start_repair(i, repairing, q, a, r, stream);
>>>     }
    }

    //
    // Phase 2: finalize any unnecessary ranks
    //
    for(int i = active_workers + 1; i < n; i++)
    {
        MPI_Send(dest=i, &done_msg, MPI_COMM_WORLD);
    }

    //
    // The progress engine. Get answers; send new requests and handle
    // process repairs
    //
    while(active_workers != 0)
    {
        rc = MPI_Waitany(n, r, &i, MPI_STATUS_IGNORE);

>>>     if(rc != MPI_SUCCESS)
>>>     {
>>>         if(!repairing[i])
>>>         {
>>>             start_repair(i, repairing, q, a, r, stream)
>>>         }
>>>         else
>>>         {
>>>             active_workers--;
>>>         {
>>>         continue;
>>>     }
>>>     repairing[i] = false;

        process_answer(&a[i]);

        if(get_next_input(stream, &q[i]) == eof)
        {
            active_workers--;
            MPI_Send(dest=i, &done_msg)
        {
        else
        {
            MPI_Send(dest=i, &q[i])
            rc = MPI_Irecv(src=i, buffer=&a[i], request=&r[i], MPI_COMM_WORLD)
>>>         if(rc != MPI_SUCCESS)
>>>         {
>>>             start_repair(i, repairing, q, a, r, stream);
>>>         }
        }
    }

    MPI_Finalize()
}

>>> void start_repair(int i, int repairing[], Query q[], Answer q[], MPI_Request r[], Stream stream)
>>> {
>>>     repairing[i] = true;
>>>     push_query_back(stream, &q[i]);
>>>     MPI_Comm_Irepair(MPI_COMM_WORLD, i, &r[i]);
>>> }

Logic description (without FT)
The master code keeps track of the number of active workers through the active_workers variable. It is solely used for the purpose of shutdown. When the master is out of input, it shuts-down the workers by sending them 'done' message. It decrease the number of active workers and finalizes when this number reaches zero.

The master's progress engine waits on a vector of requests (note that entry 0 is not used, as to simplify the code); one it gets an answer it processes it and sends the next query to that worker until it's out of input.

Logic description (with FT)
The master detects a faulty client either synchronously when it ties to initiate an async receive (no need to check the send, the assumption is that if send failed, so will the receive call), or async when the async receive completes with an error. Once an error detected (and identified as a faulty client, more about this later), the master starts an async repair of that client. If the repair succeeds, new work is sent to that client. If it does not, the number of active workers is decreased and the master has to live with less processing power.

The code above assumes that if the returned code is an error, it should repair the worker; however as we discussed, there could very well be many different reasons for an error here, which not all are related to process failure; for that we might use something in lines of

if(MPI_Error_event(rc) == MPI_EVENT_PROCESS_DOWN)...

it would be the responsibility of the MPI implementation to encode or store the event related to the returned error code.
(Note: in MPICH2 there is a mechanism that enables encoding extended error information in the error code, which then can be retrieved using MPI_Error_string)

Conclusions
I believe that the solution above describes what we have discussed in the last meeting. The required API's to support this FT are really minimal but already cover a good set of users.

Please, send your comments.
Thoughts?

Thanks,
.Erez

P.S. I will post this on the FT wiki pages (with the feedbac).
P.P.S. there is one more scenario that we discussed, and extension of the master-workers model. I will try to get it write us as-soon-as-posible.

---------------------------------------------------------------------

Intel GmbH

Dornacher Strasse 1

85622 Feldkirchen/Muenchen Germany

Sitz der Gesellschaft: Feldkirchen bei Muenchen

Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer

Registergericht: Muenchen HRB 47456 Ust.-IdNr.

VAT Registration No.: DE129385895

Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for

the sole use of the intended recipient(s). Any review or distribution

by others is strictly prohibited. If you are not the intended

recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------

Intel GmbH

Dornacher Strasse 1

85622 Feldkirchen/Muenchen Germany

Sitz der Gesellschaft: Feldkirchen bei Muenchen

Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer

Registergericht: Muenchen HRB 47456 Ust.-IdNr.

VAT Registration No.: DE129385895

Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for

the sole use of the intended recipient(s). Any review or distribution

by others is strictly prohibited. If you are not the intended

recipient, please contact the sender and delete all copies.

---------------------------------------------------------------------

Intel GmbH

Dornacher Strasse 1

85622 Feldkirchen/Muenchen Germany

Sitz der Gesellschaft: Feldkirchen bei Muenchen

Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer

Registergericht: Muenchen HRB 47456 Ust.-IdNr.

VAT Registration No.: DE129385895

Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for

the sole use of the intended recipient(s). Any review or distribution

by others is strictly prohibited. If you are not the intended

recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20090227/7326f7e1/attachment-0001.html>