[Mpi3-ft] MPI Fault Tolerance scenarios

Thu Feb 26 11:42:23 CST 2009

I might have missed it, but I don't recall saying that we'll use MPI_Comm_spawn to restart a process. As I recall there are two flavors for MPI_Comm_repair one that restart a process and the other that fixes the communicator by leaving a hole.

Rich, for now I'd like us to be explicit about restart vs fixing (rather than using a policy). So I suggest that we call the API's: MPI_Comm_restart_rank and MPI_Comm_remove_rank
(and MPI_Comm_Irestart_rank, no need for Iremove as it’s a local operation)

I'll change the sample code to reflect that.

Thoughts?

Thanks,
.Erez

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Richard Graham
Sent: Thursday, February 26, 2009 9:03 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios

MPI_Comm_Irepair will use the policy set on the communicator to repair this
communicator.  If some processes need to be restarted, this will initiate
the recovery.

Rich

On 2/26/09 11:10 AM, "Josh Hursey" <jjhursey at open-mpi.org> wrote:

> I thought that MPI_Comm_Irepair just repaired the communicator, not re-
> spawn a process? I remember that we discussed that the spawning of
> processes should be done using the MPI_Comm_spawn() commands. Though
> we will need to think about how to extend them to say spawn-and-rejoin-
> communicator-X.
>
> Has something changed with this since our previous discussions?
>
> Best,
> Josh
>
> On Feb 25, 2009, at 11:44 AM, Erez Haba wrote:
>
>> MPI_Comm_Irepair
>>
>> From: mpi3-ft-bounces at lists.mpi-forum.org
>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org
>> ] On Behalf Of Supalov, Alexander
>> Sent: Wednesday, February 25, 2009 8:40 AM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
>>
>> Thanks. What call is taking care of respawning? This is not crystal
>> clear from the scenario I'm afraid.
>>
>> From: mpi3-ft-bounces at lists.mpi-forum.org
>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org
>> ] On Behalf Of Erez Haba
>> Sent: Wednesday, February 25, 2009 5:35 PM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
>> Hi Alexander,
>>
>> As explained in the intro section, we are building the scenarios
>> bottom up to create solid foundations for the FT architecture.
>> However you are right and the scenarios will expand to use
>> collectives and other communicators, just no in this first scenario.
>>
>> This scenario is very very simple, and very very basic, to show the
>> absolute minimum required for process FT. It only uses point-to-
>> point communication over com world. The workers are stateless and
>> can be respawned without any need for state recovery.
>>
>> Thanks,
>> .Erez
>>
>> From: mpi3-ft-bounces at lists.mpi-forum.org
>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org
>> ] On Behalf Of Supalov, Alexander
>> Sent: Wednesday, February 25, 2009 6:25 AM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
>>
>> Dear  Erez,
>>
>> Thank you. A couple of questions:
>>
>> 1. You seem to restrict communication to pt2pt only. Why? A Bcast
>> upfront could be useful, for one.
>> 2. I can imagine more complicated communicator combinations than
>> only MPI_COMM_WORLD. Why do we require one communicator?
>> 3. It appears that failed slaves cannot be simply respawned. Is this
>> what a repair would do anyway?
>>
>> Best regards.
>>
>> Alexander
>>
>> From: mpi3-ft-bounces at lists.mpi-forum.org
>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org
>> ] On Behalf Of Erez Haba
>> Sent: Wednesday, February 18, 2009 3:53 AM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: [Mpi3-ft] MPI Fault Tolerance scenarios
>> Hello all,
>>
>> In our last meeting we decided to build a set of FT scenarios/
>> programs to help us understand the details of the interface need to
>> support those scenarios. We also decided to start with very simple
>> scenarios and add more complex ones as we understand the former
>> better.  I hope that starting with simple scenarios will help us
>> build a solid foundation on which we can build the more complex
>> solutions.
>>
>> When we build an FT solution we will focus on the scenario as
>> described, without complicating the solution just because it would
>> be needed later for a more complex one. The time will come later to
>> modify the solution as we acquire more knowledge and built the
>> foundations. Hence, any proposal or change that we make needs to
>> fitexactly the scenario (and all those that we previously looked at)
>> but no more.
>> For example in the first scenario that we¹ll look at there is no
>> need for saving communicator state or error callback; but they might
>> be required later.
>>
>> Note that these scenarios focus on process FT rather than checkpoint/
>> restart or network degradation. I assume we¹ll do the latter later.
>> Scenario #1: Very Simple Master-Workers
>> Description
>> This is a very simple master-workers scenario. However simple, we
>> were asked many times by customers to support FT in this scenario.
>> In this case the MPI application running with n processes, where
>> rank 0 is used as the master and n-1 ranks are used as workers.  The
>> master generates work (either by getting it directly from user
>> input, or reading a file) and sends it for processing to a free
>> worker rank. The master sends requests and receives replies using
>> MPI point-to-point communication.  The workers wait for the incoming
>> message, upon arrival the worker computes the result and sends it
>> back to the master.  The master stores the result to a log file.
>>
>> Hardening: The goal is to harden the workers, the master itself is
>> not FT, thus if it fails the entire application fails. In this case
>> the workers are FT, and are replaced to keep computation power for
>> this application. (a twist: if a worker cannot be recovered the
>> master can work with a smaller set of clients up to a low watermark).
>> Worker
>> The worker waits on a blocking receive when a message arrives it
>> process it. If a done message arrives the worker finalizes MPI and
>> exit normally.
>>
>> Hardening: There is not special requirement for hardening here. If
>> the worker encounters a communication problem with the master, it
>> means that the master is down and it¹s okay to abort the entire job.
>> Thus, it will use the default error handler (which aborts on
>> errors).  Note that we do not need to modify the client at all to
>> make the application FT (except the master).
>>
>> Pseudo code for the hardened worker:
>> int main()
>> {
>>     MPI_Init()
>>
>>     for(;;)
>>     {
>>         MPI_Recv(src=0, &query, MPI_COMM_WORLD);
>>         if(is_done_msg(query))
>>             break;
>>
>>         process_query(&query, &answer);
>>
>>         MPI_Send(dst=0, &answer, MPI_COMM_WORLD);
>>     }
>>
>>     MPI_Finalize()
>> }
>>
>>
>> Notice that for this FT code there is no requirement for the worker
>> to rejoin the comm. As the only communicator used is MPI_COMM_WORLD.
>>
>> Master
>> The master code reads queries from a stream and passes them on to
>> the workers to process. The master goes through several phases. In
>> the initialization phase it sends the first request to each one of
>> the ranks; in the second one it shuts down any unnecessary ranks (if
>> the job is too small); I the third phase it enters its progress
>> engine where it handles replies (answers), process recovery and
>> termination (on input end).
>>
>> Hardening: It is the responsibility of the master to restart any
>> failing workers and make sure that the request (query) did not get
>> lost if a worker fails. Hence, every time an error is detected the
>> master will move the worker into repairing state and move its
>> workload to other workers.
>> The master runs with errors returned rather than aborted
>>
>> One thing to note about the following code: it is not optimized. I
>> did not try to overlap computation with communication (which is
>> possible) I tried to keep it as simple as possible for the purpose
>> of discussion.
>>
>> Pseudo code for the hardened master; the code needed for repairing
>> the failed ranks is highlighted in yellow.
>> int main()
>> {
>>     MPI_Init()
>>>>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>>     MPI_Comm_size(MPI_COMM_WORLD, &n);
>>
>>     MPI_Request r[n] = MPI_REQUEST_NULL;
>>     QueryMessage q[n];
>>     AnswerMessage a[n];
>>     int active_workers = 0;
>>
>>>>> bool repairing[n] = false;
>>
>>     //
>>     // Phase 1: send initial requests
>>     //
>>     for(int i = 1; i < n; i++)
>>     {
>>         if(get_next_query(stream, &q[i]) == eof)
>>             break;
>>
>>         active_workers++;
>>         MPI_Send(dest=i, &q[i], MPI_COMM_WORLD);
>>         rc = MPI_Irecv(src=i, buffer=&a[x], request=&r[x],
>> MPI_COMM_WORLD)
>>>>>     if(rc != MPI_SUCCESS)
>>>>>     {
>>>>>         start_repair(i, repairing, q, a, r, stream);
>>>>>     }
>>     }
>>
>>     //
>>     // Phase 2: finalize any unnecessary ranks
>>     //
>>     for(int i = active_workers + 1; i < n; i++)
>>     {
>>         MPI_Send(dest=i, &done_msg, MPI_COMM_WORLD);
>>     }
>>
>>
>>     //
>>     // The progress engine. Get answers; send new requests and handle
>>     // process repairs
>>     //
>>     while(active_workers != 0)
>>     {
>>         rc = MPI_Waitany(n, r, &i, MPI_STATUS_IGNORE);
>>
>>>>>     if(!repairing[i])
>>>>>     {
>>>>>         if(rc != MPI_SUCCESS)
>>>>>         {
>>>>>             start_repair(i, repairing, q, a, r, stream)
>>>>>             continue;
>>>>>         }
>>
>>             process_answer(&a[i]);
>>>>>     }
>>>>>     else if(rc != MPI_SUCCESS)
>>>>>     {
>>>>>         active_workers--;
>>>>>     {
>>
>>         if(get_next_input(stream, &q[i]) == eof)
>>         {
>>             active_workers--;
>>             MPI_Send(dest=i, &done_msg)
>>         {
>>         else
>>         {
>>             MPI_Send(dest=i, &q[i])
>>             rc = MPI_Irecv(src=i, buffer=&a[i], request=&r[i],
>> MPI_COMM_WORLD)
>>>>>         if(rc != MPI_SUCCESS)
>>>>>         {
>>>>>             start_repair(i, repairing, q, a, r, stream);
>>>>>         }
>>         }
>>     }
>>
>>     MPI_Finalize()
>> }
>>
>>
>>>>> void start_repair(int i, int repairing[], Query q[], Answer q[],
>> MPI_Request r[], Stream stream)
>>>>> {
>>>>>     repairing[i] = true;
>>>>>     push_query_back(stream, &q[i]);
>>>>>     MPI_Comm_Irepair(MPI_COMM_WORLD, i, &r[i]);
>>>>> }
>>
>>
>>
>> Logic description (without FT)
>> The master code keeps track of the number of active workers through
>> the active_workers variable. It is solely used for the purpose of
>> shutdown. When the master is out of input, it shuts-down the workers
>> by sending them Œdone¹ message. It decrease the number of active
>> workers and finalizes when this number reaches zero.
>>
>> The master¹s progress engine waits on a vector of requests (note
>> that entry 0 is not used, as to simplify the code); one it gets an
>> answer it processes it and sends the next query to that worker until
>> it¹s out of input.
>>
>> Logic description (with FT)
>> The master detects a faulty client either synchronously when it ties
>> to initiate an async receive (no need to check the send, the
>> assumption is that if send failed, so will the receive call), or
>> async when the async receive completes with an error. Once an error
>> detected (and identified as a faulty client, more about this later),
>> the master starts an async repair of that client. If the repair
>> succeeds, new work is sent to that client. If it does not, the
>> number of active workers is decreased and the master has to live
>> with less processing power.
>>
>> The code above assumes that if the returned code is an error, it
>> should repair the worker; however as we discussed, there could very
>> well be many different reasons for an error here, which not all are
>> related to process failure; for that we might use something in lines
>> of
>>
>> if(MPI_Error_event(rc) == MPI_EVENT_PROCESS_DOWN)...
>>
>> it would be the responsibility of the MPI implementation to encode
>> or store the event related to the returned error code.
>> (Note: in MPICH2 there is a mechanism that enables encoding extended
>> error information in the error code, which then can be retrieved
>> using MPI_Error_string)
>>
>> Conclusions
>> I believe that the solution above describes what we have discussed
>> in the last meeting. The required API¹s to support this FT are
>> really minimal but already cover a good set of users.
>>
>> Please, send your comments.
>> Thoughts?
>>
>> Thanks,
>> .Erez
>>
>> P.S. I will post this on the FT wiki pages (with the feedbac).
>> P.P.S. there is one more scenario that we discussed, and extension
>> of the master-workers model. I will try to get it write us as-soon-
>> as-posible.
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel GmbH
>> Dornacher Strasse 1
>> 85622 Feldkirchen/Muenchen Germany
>> Sitz der Gesellschaft: Feldkirchen bei Muenchen
>> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>> VAT Registration No.: DE129385895
>> Citibank Frankfurt (BLZ 502 109 00) 600119052
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>> ---------------------------------------------------------------------
>> Intel GmbH
>> Dornacher Strasse 1
>> 85622 Feldkirchen/Muenchen Germany
>> Sitz der Gesellschaft: Feldkirchen bei Muenchen
>> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>> VAT Registration No.: DE129385895
>> Citibank Frankfurt (BLZ 502 109 00) 600119052
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft