[Mpi3-ft] MPI Fault Tolerance scenarios
Josh Hursey
jjhursey at open-mpi.org
Thu Feb 26 10:10:37 CST 2009
I thought that MPI_Comm_Irepair just repaired the communicator, not re-
spawn a process? I remember that we discussed that the spawning of
processes should be done using the MPI_Comm_spawn() commands. Though
we will need to think about how to extend them to say spawn-and-rejoin-
communicator-X.
Has something changed with this since our previous discussions?
Best,
Josh
On Feb 25, 2009, at 11:44 AM, Erez Haba wrote:
> MPI_Comm_Irepair
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org
> ] On Behalf Of Supalov, Alexander
> Sent: Wednesday, February 25, 2009 8:40 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
>
> Thanks. What call is taking care of respawning? This is not crystal
> clear from the scenario I'm afraid.
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org
> ] On Behalf Of Erez Haba
> Sent: Wednesday, February 25, 2009 5:35 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
> Hi Alexander,
>
> As explained in the intro section, we are building the scenarios
> bottom up to create solid foundations for the FT architecture.
> However you are right and the scenarios will expand to use
> collectives and other communicators, just no in this first scenario.
>
> This scenario is very very simple, and very very basic, to show the
> absolute minimum required for process FT. It only uses point-to-
> point communication over com world. The workers are stateless and
> can be respawned without any need for state recovery.
>
> Thanks,
> .Erez
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org
> ] On Behalf Of Supalov, Alexander
> Sent: Wednesday, February 25, 2009 6:25 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI Fault Tolerance scenarios
>
> Dear Erez,
>
> Thank you. A couple of questions:
>
> 1. You seem to restrict communication to pt2pt only. Why? A Bcast
> upfront could be useful, for one.
> 2. I can imagine more complicated communicator combinations than
> only MPI_COMM_WORLD. Why do we require one communicator?
> 3. It appears that failed slaves cannot be simply respawned. Is this
> what a repair would do anyway?
>
> Best regards.
>
> Alexander
>
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org
> ] On Behalf Of Erez Haba
> Sent: Wednesday, February 18, 2009 3:53 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] MPI Fault Tolerance scenarios
> Hello all,
>
> In our last meeting we decided to build a set of FT scenarios/
> programs to help us understand the details of the interface need to
> support those scenarios. We also decided to start with very simple
> scenarios and add more complex ones as we understand the former
> better. I hope that starting with simple scenarios will help us
> build a solid foundation on which we can build the more complex
> solutions.
>
> When we build an FT solution we will focus on the scenario as
> described, without complicating the solution just because it would
> be needed later for a more complex one. The time will come later to
> modify the solution as we acquire more knowledge and built the
> foundations. Hence, any proposal or change that we make needs to
> fitexactly the scenario (and all those that we previously looked at)
> but no more.
> For example in the first scenario that we’ll look at there is no
> need for saving communicator state or error callback; but they might
> be required later.
>
> Note that these scenarios focus on process FT rather than checkpoint/
> restart or network degradation. I assume we’ll do the latter later.
> Scenario #1: Very Simple Master-Workers
> Description
> This is a very simple master-workers scenario. However simple, we
> were asked many times by customers to support FT in this scenario.
> In this case the MPI application running with n processes, where
> rank 0 is used as the master and n-1 ranks are used as workers. The
> master generates work (either by getting it directly from user
> input, or reading a file) and sends it for processing to a free
> worker rank. The master sends requests and receives replies using
> MPI point-to-point communication. The workers wait for the incoming
> message, upon arrival the worker computes the result and sends it
> back to the master. The master stores the result to a log file.
>
> Hardening: The goal is to harden the workers, the master itself is
> not FT, thus if it fails the entire application fails. In this case
> the workers are FT, and are replaced to keep computation power for
> this application. (a twist: if a worker cannot be recovered the
> master can work with a smaller set of clients up to a low watermark).
> Worker
> The worker waits on a blocking receive when a message arrives it
> process it. If a done message arrives the worker finalizes MPI and
> exit normally.
>
> Hardening: There is not special requirement for hardening here. If
> the worker encounters a communication problem with the master, it
> means that the master is down and it’s okay to abort the entire job.
> Thus, it will use the default error handler (which aborts on
> errors). Note that we do not need to modify the client at all to
> make the application FT (except the master).
>
> Pseudo code for the hardened worker:
> int main()
> {
> MPI_Init()
>
> for(;;)
> {
> MPI_Recv(src=0, &query, MPI_COMM_WORLD);
> if(is_done_msg(query))
> break;
>
> process_query(&query, &answer);
>
> MPI_Send(dst=0, &answer, MPI_COMM_WORLD);
> }
>
> MPI_Finalize()
> }
>
>
> Notice that for this FT code there is no requirement for the worker
> to rejoin the comm. As the only communicator used is MPI_COMM_WORLD.
>
> Master
> The master code reads queries from a stream and passes them on to
> the workers to process. The master goes through several phases. In
> the initialization phase it sends the first request to each one of
> the ranks; in the second one it shuts down any unnecessary ranks (if
> the job is too small); I the third phase it enters its progress
> engine where it handles replies (answers), process recovery and
> termination (on input end).
>
> Hardening: It is the responsibility of the master to restart any
> failing workers and make sure that the request (query) did not get
> lost if a worker fails. Hence, every time an error is detected the
> master will move the worker into repairing state and move its
> workload to other workers.
> The master runs with errors returned rather than aborted
>
> One thing to note about the following code: it is not optimized. I
> did not try to overlap computation with communication (which is
> possible) I tried to keep it as simple as possible for the purpose
> of discussion.
>
> Pseudo code for the hardened master; the code needed for repairing
> the failed ranks is highlighted in yellow.
> int main()
> {
> MPI_Init()
> >>> MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
> MPI_Comm_size(MPI_COMM_WORLD, &n);
>
> MPI_Request r[n] = MPI_REQUEST_NULL;
> QueryMessage q[n];
> AnswerMessage a[n];
> int active_workers = 0;
>
> >>> bool repairing[n] = false;
>
> //
> // Phase 1: send initial requests
> //
> for(int i = 1; i < n; i++)
> {
> if(get_next_query(stream, &q[i]) == eof)
> break;
>
> active_workers++;
> MPI_Send(dest=i, &q[i], MPI_COMM_WORLD);
> rc = MPI_Irecv(src=i, buffer=&a[x], request=&r[x],
> MPI_COMM_WORLD)
> >>> if(rc != MPI_SUCCESS)
> >>> {
> >>> start_repair(i, repairing, q, a, r, stream);
> >>> }
> }
>
> //
> // Phase 2: finalize any unnecessary ranks
> //
> for(int i = active_workers + 1; i < n; i++)
> {
> MPI_Send(dest=i, &done_msg, MPI_COMM_WORLD);
> }
>
>
> //
> // The progress engine. Get answers; send new requests and handle
> // process repairs
> //
> while(active_workers != 0)
> {
> rc = MPI_Waitany(n, r, &i, MPI_STATUS_IGNORE);
>
> >>> if(!repairing[i])
> >>> {
> >>> if(rc != MPI_SUCCESS)
> >>> {
> >>> start_repair(i, repairing, q, a, r, stream)
> >>> continue;
> >>> }
>
> process_answer(&a[i]);
> >>> }
> >>> else if(rc != MPI_SUCCESS)
> >>> {
> >>> active_workers--;
> >>> {
>
> if(get_next_input(stream, &q[i]) == eof)
> {
> active_workers--;
> MPI_Send(dest=i, &done_msg)
> {
> else
> {
> MPI_Send(dest=i, &q[i])
> rc = MPI_Irecv(src=i, buffer=&a[i], request=&r[i],
> MPI_COMM_WORLD)
> >>> if(rc != MPI_SUCCESS)
> >>> {
> >>> start_repair(i, repairing, q, a, r, stream);
> >>> }
> }
> }
>
> MPI_Finalize()
> }
>
>
> >>> void start_repair(int i, int repairing[], Query q[], Answer q[],
> MPI_Request r[], Stream stream)
> >>> {
> >>> repairing[i] = true;
> >>> push_query_back(stream, &q[i]);
> >>> MPI_Comm_Irepair(MPI_COMM_WORLD, i, &r[i]);
> >>> }
>
>
>
> Logic description (without FT)
> The master code keeps track of the number of active workers through
> the active_workers variable. It is solely used for the purpose of
> shutdown. When the master is out of input, it shuts-down the workers
> by sending them ‘done’ message. It decrease the number of active
> workers and finalizes when this number reaches zero.
>
> The master’s progress engine waits on a vector of requests (note
> that entry 0 is not used, as to simplify the code); one it gets an
> answer it processes it and sends the next query to that worker until
> it’s out of input.
>
> Logic description (with FT)
> The master detects a faulty client either synchronously when it ties
> to initiate an async receive (no need to check the send, the
> assumption is that if send failed, so will the receive call), or
> async when the async receive completes with an error. Once an error
> detected (and identified as a faulty client, more about this later),
> the master starts an async repair of that client. If the repair
> succeeds, new work is sent to that client. If it does not, the
> number of active workers is decreased and the master has to live
> with less processing power.
>
> The code above assumes that if the returned code is an error, it
> should repair the worker; however as we discussed, there could very
> well be many different reasons for an error here, which not all are
> related to process failure; for that we might use something in lines
> of
>
> if(MPI_Error_event(rc) == MPI_EVENT_PROCESS_DOWN)...
>
> it would be the responsibility of the MPI implementation to encode
> or store the event related to the returned error code.
> (Note: in MPICH2 there is a mechanism that enables encoding extended
> error information in the error code, which then can be retrieved
> using MPI_Error_string)
>
> Conclusions
> I believe that the solution above describes what we have discussed
> in the last meeting. The required API’s to support this FT are
> really minimal but already cover a good set of users.
>
> Please, send your comments.
> Thoughts?
>
> Thanks,
> .Erez
>
> P.S. I will post this on the FT wiki pages (with the feedbac).
> P.P.S. there is one more scenario that we discussed, and extension
> of the master-workers model. I will try to get it write us as-soon-
> as-posible.
>
>
>
> ---------------------------------------------------------------------
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen Germany
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> VAT Registration No.: DE129385895
> Citibank Frankfurt (BLZ 502 109 00) 600119052
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen Germany
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> VAT Registration No.: DE129385895
> Citibank Frankfurt (BLZ 502 109 00) 600119052
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list