[Mpi3-ft] mpi3-ft Digest, Vol 20, Issue 2

Wed Oct 21 02:57:06 CDT 2009

Just a few comments, not to respond directly to any specific question, but sort of to refocus he discussions.  I think the recent set of discussions, while intended to get clarification to the document that we have started to write, and is far from ready for any sort of "public" presentation.

First, remember that in MPI an normally terminating process is required to call either MPI_Abort() or MPI_Finalize() (hope I am not forgetting something here).  Here we are trying to deal with the case that a process has "terminated" (either real termination, or loss of contact) w/o satisfying this criteria.  This is what is meant by a "failed rank".  The question is how to respond to this failure (and on the implementation side of things, which we are NOT addressing in the standard, how to distinguish between a bad program (segfault, ...) and system failure), and this is what we are trying to define.

The user specifies response policies, and must make a specific request to restart/replace/or what ever else you may want to call this.  This request can be local, e.g. MPI functionality with the calling process/es is returned to "normal" state (call it restore, if you want), or it can be global with respect to the communicator, which allows global communications (aka collectives) to proceed successfully.  The nature of recovery needed really depends on the nature of the the application, and I would expect that many current HPC applications would need the collective restoration.

The rejoin functionality gives the restored/replaced rank the chance to become a member again (rejoin) of existing communicators, and reinitialize user level handles associated with this communicator.  What we have not specified, and we do need to, is what will happen in a rank is restored, but elects not to rejoin a communicator of which it's predecessor was a member.

MPI_Comm_spawn() has very clearly defined semantics, and starts new MPI process in new communicators (one intercommunicator, and 3 intra-communicators, comm_world, comm_self, and comm_null).  Lets not add yet another meaning to comm_spawn - we would have to change the interface, and break existing codes (the few that use this function).  Maybe we would want to call this MPI_Comm_recover(), but it would be very different than spawn - it 2 surviving ranks in a communicator ask to restore the same failed rank, only new instance of this restored rank will be created.

Rich

On 10/20/09 11:25 PM, "Solt, David George" <david.solt at hp.com> wrote:

I think a detailed explanation of R^3 will be helpful.  We can cut it down for the final document, but we have to keep in mind that this work is very involved and may require more explanation than other MPI API's which are more straightforward.  It is my believe that for MPI-1 and much of MPI-2, the user community already knew what they wanted from MPI before the spec came out.  They had an idea of the basic concepts (communicators, collectives, point-to-point, etc) and the API simply told people how to access functionality they already wanted.  For Fault-tolerance, I think there are many "consumers" that have no expectation about how fault-tolerance should happen (only that they want it) and will read the MPI-3 standard looking for answers.

I will wait on the updates to the rejoin/restore/recover explanation before commenting on each individual point in my e-mails.   Don't worry though, I will make sure to review all my questions and make sure that either I understand the answer or the spec now addresses my concern.  As I read through the responses to my comments, there were a couple of key themes that I think stood out:

1)  What does an "failed rank" mean.  If a communicator has ranks 0, 1 & 2 and rank 2 exits, what is the status of that communicator from rank 0 and 1's perspective.   I think we need clarity on this concept.   MPI_Comm_validate returns a failed process count and I don't think we have communicated precisely enough what that means.   Do rank 0 and 1 return that there is 1 failed rank because rank 2 has exited?  Do rank 0 and 1 return that there are 3 failed ranks since 0 is failed (can't talk to 2), 1 is failed (can't talk to 2), 2 is failed (nobody can reach it, so we don't know its status)?   Once we have agreement on what should happen in this simplest case, then we need to expand it to more difficult scenarios (some ranks can't talk to other ranks, etc.)

2) How much do we rely on existing API's vs. creating new API's.   One option is to use existing API's when possible if the API itself communicates enough information to provide the functionality we want.  An example is using MPI_Comm_spawn for restoring failed ranks vs. using a new API like MPI_Comm_restore.  I think there are pro's and con's to both approaches.  I prefer not to have a spec with multiple ways to do something, but at the same time I understand that hijacking existing routines and forcing them to make certain fault-tolerance guarantees may be concerning for some people.   However, I believe that fault-tolerance is never going to be limited to just the API's we propose.  For example, an MPI-2.x compliant implementation could hang in a call to MPI_Send to an exited rank.   There are no guarantees about what happens in the presence of failures.  In an MPI-3 Fault Tolerant implementation, MPI_Send cannot hang when communicating with an exited rank.  We have impli!
 citly added further requirements to that API.  I think that it is ok for us to do the same with other APIs (MPI_Comm_spawn or MPI_Comm_set_errhandler).

3) Another general theme is to what extent recovery (all aspects: rejoin, restore, recovery) is a global operation or a local operation.   Hopefully that is being clarified already.  I believe that we will discover that many useful recovery scenarios will require some form of global cooperation, which in turn requires a strong agreement on by ranks on what needs to be fixed and when to make the collective calls to fix it without deadlocking.  We have an internal non-compliant MPI implementation that does this in a very aggressive way.  If a rank "sees" a failure on a communicator, it makes a non-compliant MPI call to actively go out and cause all communication (on any communicator) to fail for all other processes in the target communicator.  This is a very large hammer, but it was the only way we found that would allow code of any complexity to be written deadlock-free.

Again, I will return to my other comments and Richard's responses on a point-by-point basis at a later time.

Thanks,
Dave S.

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Barrett, Richard F.
Sent: Friday, October 09, 2009 10:13 AM
To: mpi3-ft at lists.mpi-forum.org
Subject: Re: [Mpi3-ft] mpi3-ft Digest, Vol 20, Issue 2

Folks,
   Awhile back David Solt sent some questions concerning the FT API to the working group mail list, and I've interleaved my thoughts on the topics. It seems I have more questions than answers, so I would appreciate your feedback prior to sending anything out to the list. David, thanks greatly for your careful reading of the draft. This is invaluable feedback, so much so that it's taken a long time to address, though still incompletely.

General:

 1.  Discussion occurred during the recent telecon, though I don't believe I fully captured that discussion.
 2.  I will attempt a stronger discussion of rejoin, restore, recover (aka R^3). May be more wordy that a spec calls for, so may be pulled out later, included in a more discussion-like paper, but will need a crisp discussion in the spec.
 3.  I will attempt to address some of the issues in the draft spec, esp. "Advice to" parts discussed below. However, your input greatly needed, now or once I put out a strawman.
 4.  I've also mentioned some specific places I will modify the spec.
 5.  I've also asked for anyone's input in many areas. And of course welcome it in any area.
 6.  And I've also asked for clarification from David.

Please keep in mind that I am from the user side, so please understand ignorant statements/questions regarding obvious system (and other) issues. I hope this is only to the extent that it encourages you to correct rather than ignore :) And finally, I've been up 72 hours with about 6 dedicated to sleep, and since I am not nearly as resilient as Rich Graham, I am likely incoherent. (Obviously I lean heavily on excuses:)

Richard
> -----Original Message-----
> From: Solt, David George
> Sent: Tuesday, September 01, 2009 10:36 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] Questions about the the Proposed API document
>
> Hi all,
>
> Admittedly, I have missed a lot of discussion during recent months, so feel
> free to ignore questions that have already been answered.
>
> Thanks,
> Dave
>
>
> 0)  MPI_ERROR_REPORTING_FN
>
>         a)  How is this different from using MPI_ERRORHANDLER_CREATE and
> MPI_COMM_ERRHANDLER_SET

There appears to be (close-to-) consensus that the current error handling functionality does, or should be able to, provide the mechanisms we are after. It seems the intent may not be specific to out work, but perhaps could be "modified" to allow for the FT functionality we've been discussing. For example, the "Advice to Implementers" in the MPI error handling section (p277) talks about FT: "A good quality implementation will, to the greatest possible extent, circumscribe the impact of an error, so that normal processing can continue after an error handler was invoked."

However, as currently written the MPI error handler seems designed to eliminate user participation in the process: "The purpose of these error handlers is to allow a user to issue user-defined error messages and to take action unrelated to MPI (such as flushing buffers) before a program exits." (p277) Otoh, there is text that seems to muddle that notion. Anyway, the apparent goal of what we've written thus far is to allow the user to supplement error messages and handling internal to a FT mpi implementation. That is, among other things, the user has already attached FT attributes to communicator(s).

>
> 1)  MPI_Comm_validate.
>
>         a) Is it required that all callers return the same value for
> failed_process_count and failed_ranks?

This is collective, so all participants should receive the same values. That said, this points to a particular issue we need to address/discuss, ie additional failures that occur after a process returns from the call while other processes have not. Anyone have a paragraph or so for inclusion in an advice to users? And probably implementers as well?

>         b) If ranks are partitioned into two groups A and B such that all
> ranks in A can communicate and all ranks in B can communicate, but a rank in A
> cannot communicate with a rank in B, what should failed_ranks return for a
> rank in A?  a rank in B?

Does partition mean subcommunicators? If so, we're covered. If not, we're still covered :) If a rank cannot communicate with another, their is a failure, and it should be returned as such.

>         c) I was told that MPI_Comm_validate uses a phased system such that
> the result of the call is based on the callers' states prior to the call or at
> the start of the call but with the understanding that the results are not
> guaranteed to be accurate at the return of the call.   Is this accurate?  If
> so, can you show an example of where this call would either simplify an
> application code or allow for a recovery case that would not be possible
> without it?

I believe this is correct, but need an example. Anyone?

> 2) MPI_Comm_Ivalidate.
>
>         a) Is there a way to wait on the resulting request in such a way that
> you can access the failed_process_count and failed_ranks data?

Oversight on our part - I will add the arrays so that this blocking and non-blocking apis are analogous to, for example, MPI_Recv and MPI_Irecv.

> 3) MPI_Comm_restorable.
>
>         a) Does this return count=0 for a rank that is already a member of the
> original application launched MPI_Comm_world?
>         The following assumes that the answer to the above question is yes:
> In order for this to have the data necessary, a "replacement" process must be
> created through MPI_Comm_restore (i.e. user's can't bring their own singletons
> into existence through a scheduler, etc.)

Sounds like a good idea. But are we missing something? Can you please elaborate?

> 4) MPI_Comm_rejoin.
>
>         a) Is this intended only to be used by a process that was not
> previously a member of comm_names and the caller replaces an exited rank that
> was a member of comm_names?

Yes.

>         b) Must MPI_Comm_rejoin and MPI_Comm_restore be used in matching way
> between existing ranks and newly created ranks?  If ranks A and B call
> MPI_Comm_restore, which creates a new replacement rank C, will the call to
> MPI_Comm_restore hang until MPI_Comm_rejoin is called by C?

Greg: Comm_rejoin called by any rank process - too generic. If I have a process created by comm_restore, can rejoin. Restore is about existence, rejoin about communicators. By default, restore gets a process into mpi_comm_world and comm_self. Otherwise, need to rejoin other comms. I'll try to make this clear in the text.

> 5) MPI_Comm_restore.
>
>         a) Does this create processes (I have assumed so in Q#4b above)?   If
> so, I suggest that we learn from the problem with MPI_Comm_spawn from MPI-2
> that interaction with a scheduler should be considered as we develop the API.

We need to better understand the comm_spawn issue. Current schedulers won't let you change size of resource allocated. Need "Advice to Users" in MPI_Comm_spawn section as well as FT in order to "alert" the user to the potential issues.

> 6) MPI_Comm_proc_gen/MPI_Comm_gen
>
>         a) The name MPI_Comm_proc_gen seems like it should be MPI_Proc_gen.
> I see that all other routines are prefixed with MPI_Comm_, but I think that
> they all genuinely involve aspects of a communicator except for this one.

Agreed, it is not as much about communicators as other routines. So where should this then go? "Process Creation and Mgmt" seems appropriate, but that section seems ignored by most, and besides, the statement is that  "If restored by MPI fault tolerance capabilities, the process generation is incremented by one. The initial generation is zero." To me then this would be useful for tracking the fault and resilience history of a particular run. From that it then makes sense to me to include it in the FT chapter since it is really a "utility" for that functionality. Comments?

>         b) Ranks 0,1,2 are part of a restorable communicator C.  Rank 2 dies.
> Rank 0 calls MPI_Comm_restore.   Is rank 1 obligated to make any calls before
> using communicator C successfully?   What will MPI_Comm_proc_gen return for
> rank 0? 1?  2?   What will MPI_Comm_gen return for rank 0? 1? 2?

I believe the immediate above discussion addresses the latter two questions. But does point to the need to strengthen the text (which Rainer calls for in the discussion text), which I'll try to do. Perhaps enough to say that this is a local call? Regarding the first question, this is part of the, should I say, "race condition" issue discussed above? The communicator is "broken", which rank 1 will be informed of when it make a call to the communicator. So my question then: MPI_Comm_restore is collective, and therefore all processes in the communicator must call it before any can return. What if a rank rarely (legal) if ever (bad code) actually contacts that rank? But wait, that's certainly the situation for MPI_COMM_WORLD.
>
> 7) General question:
>
>         a) If rank x fails to communicate using point-to-point communication
> to rank y over communicator C, is it guaranteed that any collective call made
> by rank x or y on communicator C will immediately fail (even if the path
> between x and y is not used for the collective)?  (or is it up to the
> implementation)

Sounds like an undefined situation (i.e. A rank is a member of a communicator across which it doesn't communicate). Implementation dependent behavior, then? Which calls for an "Advice to U/I".

> Some more questions for us to think about.  It is quite possible that I have
> some fundamental flaws in my thinking that make some, many or all of these
> questions invalid.  So, I ask that if anyone sees a basic fallacy in my view
> of how these calls are intended to work that you point that out to me first
> and I can review my questions and see if there are still issues that do not
> make sense to me.

My view is that any confusion you might have is certainly from an educated perspective, so must be addressed/clarified.

>
> 8) If MPI_Comm_rejoin passes MPI_COMM_WORLD, then does it really change its
> size of MPI_COMM_WORLD?

My understanding is that we don't want to "replace" ranks unless and until the user makes it so. So new ranks may append to the set of ranks, which included failed ones. But MPI_COMM_SIZE will return the number of active ranks. Did I say this correctly?

> 9) Why does MPI_Comm_restore take an array of ranks to restore?   Shouldn't
> the restored ranks be based on MPI_PROC_RESTORE_POLICY? Or maybe a better way
> to ask: "What call does MPI_PROC_RESTORE_POLICY influence?"

Calls for a "Rationale" or "Advice" text imo. I don't believe we want an API that takes an array in one case and a scalar in another. Analogous to transmission of data. But again, am I understanding the question?

> 10) The API document says this regarding MPI_Comm_Irestore: "It is local in
> scope, and thus restores local communications (point-to-point, one-sided,
> data-type creation, etc.), but not collective communications."  If this is
> true, then how do you restore collective communications?   Can you then go on
> to collectively call MPI_Comm_restore_all?   If you do, would every rank need
> to specify that no new ranks are to be created since they have already been
> created by the earlier call to MPI_Comm_restore?  Also, I don't think it is
> *really* local in scope, if it was, there would be no reason to have a
> non-blocking version.

I expect much of this will be clarified with a stronger discussion of R^3 (mentioned above).But the non-blocking to my understanding is designed to execution to continue while restoration takes place. For example, a Monte Carlo code can simply say, "that process is not there so I will process without its data. But I would like 'it' back if possible, so I'll check later.

>
> 11) MPI_COMM_REJOIN - It seems like the resulting communicator should be
> collective-capable if the calling process was created through a call to
> MPI_Comm_restore_all and not collective-capable if created through a call to
> MPI_Comm_restore?  If we go with that, there should be a way for the caller of
> MPI_Comm_rejoin to know the status of the communicator with respect to
> collectives.

Another that will become clear with a stronger discussion of R^3 (he says optimistically, or is that lackadaisically? :)

> 12) MPI_Comm_restore specifies ranks of communicators that should be restored.
> I assume it will block until the restored ranks call MPI_Comm_rejoin on those
> communicators?   (I say that because of the line "[MPI_Comm_restore ] is local
> in scope, and thus restores local communications...".  Restoring local
> communications to who?   I assume to the process created by MPI_Comm_restore?
> If it does not block, how does it know when it can safely talk to the restored
> ranks using any of the communicators they are in?   So, I assume it blocks.
> That seems to imply that the restored rank MUST call MPI_Comm_rejoin on the
> communicator referenced in its creation.   If rank 0 calls MPI_Comm_restore()
> and passes in Rank 3 of communicator FOO, then the restored process must call
> MPI_Comm_rejoin on communicator FOO.  But when the restored rank 3 calls
> MPI_Comm_recoverable, it could return several communicators and rank 3 has no
> idea that it MUST call MPI_Comm_rejoin on some, but is not req!
>  uired to call MPI_Comm_rejoin on others?

If it blocks, the process can't call MPI_Comm_rejoin. So do we have a contradiction in the spec? Here's to hoping that clarification of R^3 makes this question go away :)

> 13) What does MPI_COMM_WORLD look like before the new process calls
> MPI_COMM_REJOIN.  If the process was created through a call to
> MPI_Comm_restore that specified multiple ranks to be restored, are all of
> those ranks together in an MPI_COMM_WORLD until they call MPI_Comm_rejoin?  Is
> the MPI_Comm_rejoin call collective across all of those newly created
> processes or can they all call one at a time at their leisure?

A "broken" communication will have invalid ranks, revealed to the calling process when it attempts to use the rank, and in the manner specified by the FT configuration (default or user specified). But again, the R^3 text will have to clear this up. Also, see the "Discussion" text in the spec for an additional issue. And regarding MPI_COMM_WORLD (and MPI_COMM_SELF), these are by default intended to be restored (from a local view). Hmmm, that statement confuses even me. So he punts back to R^3.

> 14) Is there anything we are proposing with MPI_Comm_rejoin/restore that
> cannot be accomplished with MPI_Comm_spawn, MPI_Comm_merge?  The only thing I
> can think of is that MPI_COMM_WORLD cannot be "fixed" using
> MPI_Comm_spawn/merge, but only because it is a constant.

My understanding is that we are attempting to bridge the gap between an "invisibly" fault tolerant implementation and a fully user controlled scheme, where that gap may be small (or non-existent?) to large.

> 15) ranks_to_restore struct is not defined in the version of API I have.

I've confused myself with my attempt to be clever. Is this the "Who's on first and what's on second" discussion (note Abbott and Costello reference), meaning I didn't understand either so gave placeholder names. Does your copy include this discussion?

> 16) MPI_Comm_restore seems to be based on the idea that some ranks have
> exited.   What if rank A cannot talk to rank B, but rank B still exists and
> can talk to rank C?  What does it mean to restore a rank in this case?  None
> of the ranks are gone, they are just having communication problems.   It seems
> like there should be some way to come up with a failure free set of ranks such
> that all the ranks in the set can communicate across all process pairs.

My understanding is that a process in a communicator that cannot communicate with all processes in that communicator indicates a fault. But who is at fault may be the appropriate question that you are asking. R^3 discussion? Your last sentence, however, seems to me to point to a new communicator the user would have to create.

> 17) Ranks 0, 1, & 2 are in Comm FOO. Rank 2 dies.   Rank 0 calls
> MPI_Comm_restore({FOO,2}) and can now communicate with 2 once again using
> point-to-point calls?   Is there a way that 1 can ever restore communication
> to the new rank 2?   I believe the only way is that all ranks (including the
> new rank 2) collectively call MPI_Comm_restore({})?  I'm not sure that is a
> problem, but I wanted to check my understanding of how these calls work.

The the first answer that popped into my head contradicts something I said above (ie the communicator is broken). So R^3?

Just my pass at addressing a broad set of issues. Please, please, please, don't try to spare my feelings, just view this as the start of what should be a storng disucssion.

Richard
--
  Richard Barrett
  Application Performance Tools group
  Computer Science and Mathematics Division
  Oak Ridge National Laboratory

  http://users.nccs.gov/~rbarrett

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft