[Mpi3-ft] MPI Fault Tolerance scenarios

Thu Feb 26 10:10:37 CST 2009

I thought that MPI_Comm_Irepair just repaired the communicator, not re- 
spawn a process? I remember that we discussed that the spawning of  
processes should be done using the MPI_Comm_spawn() commands. Though  
we will need to think about how to extend them to say spawn-and-rejoin- 
communicator-X.

Has something changed with this since our previous discussions?

Best,
Josh

On Feb 25, 2009, at 11:44 AM, Erez Haba wrote:

> MPI_Comm_Irepair
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org 
> ] On Behalf Of Supalov, Alexander
> Sent: Wednesday, February 25, 2009 8:40 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
>
> Thanks. What call is taking care of respawning? This is not crystal  
> clear from the scenario I'm afraid.
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org 
> ] On Behalf Of Erez Haba
> Sent: Wednesday, February 25, 2009 5:35 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
> Hi Alexander,
>
> As explained in the intro section, we are building the scenarios  
> bottom up to create solid foundations for the FT architecture.  
> However you are right and the scenarios will expand to use  
> collectives and other communicators, just no in this first scenario.
>
> This scenario is very very simple, and very very basic, to show the  
> absolute minimum required for process FT. It only uses point-to- 
> point communication over com world. The workers are stateless and  
> can be respawned without any need for state recovery.
>
> Thanks,
> .Erez
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org 
> ] On Behalf Of Supalov, Alexander
> Sent: Wednesday, February 25, 2009 6:25 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
>
> Dear  Erez,
>
> Thank you. A couple of questions:
>
> 1. You seem to restrict communication to pt2pt only. Why? A Bcast  
> upfront could be useful, for one.
> 2. I can imagine more complicated communicator combinations than  
> only MPI_COMM_WORLD. Why do we require one communicator?
> 3. It appears that failed slaves cannot be simply respawned. Is this  
> what a repair would do anyway?
>
> Best regards.
>
> Alexander
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org 
> ] On Behalf Of Erez Haba
> Sent: Wednesday, February 18, 2009 3:53 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] MPI Fault Tolerance scenarios
> Hello all,
>
> In our last meeting we decided to build a set of FT scenarios/ 
> programs to help us understand the details of the interface need to  
> support those scenarios. We also decided to start with very simple  
> scenarios and add more complex ones as we understand the former  
> better.  I hope that starting with simple scenarios will help us  
> build a solid foundation on which we can build the more complex  
> solutions.
>
> When we build an FT solution we will focus on the scenario as  
> described, without complicating the solution just because it would  
> be needed later for a more complex one. The time will come later to  
> modify the solution as we acquire more knowledge and built the  
> foundations. Hence, any proposal or change that we make needs to  
> fitexactly the scenario (and all those that we previously looked at)  
> but no more.
> For example in the first scenario that we’ll look at there is no  
> need for saving communicator state or error callback; but they might  
> be required later.
>
> Note that these scenarios focus on process FT rather than checkpoint/ 
> restart or network degradation. I assume we’ll do the latter later.
> Scenario #1: Very Simple Master-Workers
> Description
> This is a very simple master-workers scenario. However simple, we  
> were asked many times by customers to support FT in this scenario.
> In this case the MPI application running with n processes, where  
> rank 0 is used as the master and n-1 ranks are used as workers.  The  
> master generates work (either by getting it directly from user  
> input, or reading a file) and sends it for processing to a free  
> worker rank. The master sends requests and receives replies using  
> MPI point-to-point communication.  The workers wait for the incoming  
> message, upon arrival the worker computes the result and sends it  
> back to the master.  The master stores the result to a log file.
>
> Hardening: The goal is to harden the workers, the master itself is  
> not FT, thus if it fails the entire application fails. In this case  
> the workers are FT, and are replaced to keep computation power for  
> this application. (a twist: if a worker cannot be recovered the  
> master can work with a smaller set of clients up to a low watermark).
> Worker
> The worker waits on a blocking receive when a message arrives it  
> process it. If a done message arrives the worker finalizes MPI and  
> exit normally.
>
> Hardening: There is not special requirement for hardening here. If  
> the worker encounters a communication problem with the master, it  
> means that the master is down and it’s okay to abort the entire job.  
> Thus, it will use the default error handler (which aborts on  
> errors).  Note that we do not need to modify the client at all to  
> make the application FT (except the master).
>
> Pseudo code for the hardened worker:
> int main()
> {
>     MPI_Init()
>
>     for(;;)
>     {
>         MPI_Recv(src=0, &query, MPI_COMM_WORLD);
>         if(is_done_msg(query))
>             break;
>
>         process_query(&query, &answer);
>
>         MPI_Send(dst=0, &answer, MPI_COMM_WORLD);
>     }
>
>     MPI_Finalize()
> }
>
>
> Notice that for this FT code there is no requirement for the worker  
> to rejoin the comm. As the only communicator used is MPI_COMM_WORLD.
>
> Master
> The master code reads queries from a stream and passes them on to  
> the workers to process. The master goes through several phases. In  
> the initialization phase it sends the first request to each one of  
> the ranks; in the second one it shuts down any unnecessary ranks (if  
> the job is too small); I the third phase it enters its progress  
> engine where it handles replies (answers), process recovery and  
> termination (on input end).
>
> Hardening: It is the responsibility of the master to restart any  
> failing workers and make sure that the request (query) did not get  
> lost if a worker fails. Hence, every time an error is detected the  
> master will move the worker into repairing state and move its  
> workload to other workers.
> The master runs with errors returned rather than aborted
>
> One thing to note about the following code: it is not optimized. I  
> did not try to overlap computation with communication (which is  
> possible) I tried to keep it as simple as possible for the purpose  
> of discussion.
>
> Pseudo code for the hardened master; the code needed for repairing  
> the failed ranks is highlighted in yellow.
> int main()
> {
>     MPI_Init()
> >>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
>     MPI_Comm_size(MPI_COMM_WORLD, &n);
>
>     MPI_Request r[n] = MPI_REQUEST_NULL;
>     QueryMessage q[n];
>     AnswerMessage a[n];
>     int active_workers = 0;
>
> >>> bool repairing[n] = false;
>
>     //
>     // Phase 1: send initial requests
>     //
>     for(int i = 1; i < n; i++)
>     {
>         if(get_next_query(stream, &q[i]) == eof)
>             break;
>
>         active_workers++;
>         MPI_Send(dest=i, &q[i], MPI_COMM_WORLD);
>         rc = MPI_Irecv(src=i, buffer=&a[x], request=&r[x],  
> MPI_COMM_WORLD)
> >>>     if(rc != MPI_SUCCESS)
> >>>     {
> >>>         start_repair(i, repairing, q, a, r, stream);
> >>>     }
>     }
>
>     //
>     // Phase 2: finalize any unnecessary ranks
>     //
>     for(int i = active_workers + 1; i < n; i++)
>     {
>         MPI_Send(dest=i, &done_msg, MPI_COMM_WORLD);
>     }
>
>
>     //
>     // The progress engine. Get answers; send new requests and handle
>     // process repairs
>     //
>     while(active_workers != 0)
>     {
>         rc = MPI_Waitany(n, r, &i, MPI_STATUS_IGNORE);
>
> >>>     if(!repairing[i])
> >>>     {
> >>>         if(rc != MPI_SUCCESS)
> >>>         {
> >>>             start_repair(i, repairing, q, a, r, stream)
> >>>             continue;
> >>>         }
>
>             process_answer(&a[i]);
> >>>     }
> >>>     else if(rc != MPI_SUCCESS)
> >>>     {
> >>>         active_workers--;
> >>>     {
>
>         if(get_next_input(stream, &q[i]) == eof)
>         {
>             active_workers--;
>             MPI_Send(dest=i, &done_msg)
>         {
>         else
>         {
>             MPI_Send(dest=i, &q[i])
>             rc = MPI_Irecv(src=i, buffer=&a[i], request=&r[i],  
> MPI_COMM_WORLD)
> >>>         if(rc != MPI_SUCCESS)
> >>>         {
> >>>             start_repair(i, repairing, q, a, r, stream);
> >>>         }
>         }
>     }
>
>     MPI_Finalize()
> }
>
>
> >>> void start_repair(int i, int repairing[], Query q[], Answer q[],  
> MPI_Request r[], Stream stream)
> >>> {
> >>>     repairing[i] = true;
> >>>     push_query_back(stream, &q[i]);
> >>>     MPI_Comm_Irepair(MPI_COMM_WORLD, i, &r[i]);
> >>> }
>
>
>
> Logic description (without FT)
> The master code keeps track of the number of active workers through  
> the active_workers variable. It is solely used for the purpose of  
> shutdown. When the master is out of input, it shuts-down the workers  
> by sending them ‘done’ message. It decrease the number of active  
> workers and finalizes when this number reaches zero.
>
> The master’s progress engine waits on a vector of requests (note  
> that entry 0 is not used, as to simplify the code); one it gets an  
> answer it processes it and sends the next query to that worker until  
> it’s out of input.
>
> Logic description (with FT)
> The master detects a faulty client either synchronously when it ties  
> to initiate an async receive (no need to check the send, the  
> assumption is that if send failed, so will the receive call), or  
> async when the async receive completes with an error. Once an error  
> detected (and identified as a faulty client, more about this later),  
> the master starts an async repair of that client. If the repair  
> succeeds, new work is sent to that client. If it does not, the  
> number of active workers is decreased and the master has to live  
> with less processing power.
>
> The code above assumes that if the returned code is an error, it  
> should repair the worker; however as we discussed, there could very  
> well be many different reasons for an error here, which not all are  
> related to process failure; for that we might use something in lines  
> of
>
> if(MPI_Error_event(rc) == MPI_EVENT_PROCESS_DOWN)...
>
> it would be the responsibility of the MPI implementation to encode  
> or store the event related to the returned error code.
> (Note: in MPICH2 there is a mechanism that enables encoding extended  
> error information in the error code, which then can be retrieved  
> using MPI_Error_string)
>
> Conclusions
> I believe that the solution above describes what we have discussed  
> in the last meeting. The required API’s to support this FT are  
> really minimal but already cover a good set of users.
>
> Please, send your comments.
> Thoughts?
>
> Thanks,
> .Erez
>
> P.S. I will post this on the FT wiki pages (with the feedbac).
> P.P.S. there is one more scenario that we discussed, and extension  
> of the master-workers model. I will try to get it write us as-soon- 
> as-posible.
>
>
>
> ---------------------------------------------------------------------
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen Germany
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> VAT Registration No.: DE129385895
> Citibank Frankfurt (BLZ 502 109 00) 600119052
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen Germany
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> VAT Registration No.: DE129385895
> Citibank Frankfurt (BLZ 502 109 00) 600119052
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft