[Mpi3-ft] draft spec update; was; mpi3-ft Digest, Vol 21, Issue 4

Barrett, Richard F. rbarrett at ornl.gov
Fri Oct 16 15:47:34 CDT 2009


I've been on travel all week, but as promised, I'm committing my work today.
I'm an iterative solver type, so I'm ok with ugly beginnings, hope you are
too :) I've committed to the svn but am also attaching the draft pdf.

First, some changes I made external of specific discussions:

1. We use the term "result" as an argument in MPI_Comm_restore. I believe
this should more appropriately be named "status" in order to line up the the
rest of the spec. However, that term is defined with fields pertaining to
the "result" of pt2pt comm, ie source, tag, and status. Any ideas? Just to
call attention to this, I've renamed "result" to "ft_status". Anyway, once
the struct is defined, the "description" in the function definition will be
self-explanatory, as is done with MPI_Status.

2. I've eliminated the term "struct" from the function definitions. It is
only a struct in C, and not really even there until it is subsequently
defined in the spec. Instead, it is simply referred to as an "object".
Please keep an eye out for this sort of thing throughout our text, I'll bet
I've missed some! (For example, "array of ranks to be restored" -
prototypers will ultimately define what is needed.)

3. I've added a "Misc" section containing general observations, ideas, and
thoughts, including throughout the existing document, as they relate to FT.

4. Added a subsection discussing the 'status' param in the comm_restore
section. This is done just after MPI_Recv in the existing spec, so seems
like a reasonable place to start this conversation.

5. Made some corrections thanks to careful reading by David and Rainer,
discussions with Rich, others.

6. Re-wrote the intro to 0.2 and the first two attributes as guided by (0)
below. This was motivated by the need to remove the unwanted qualifier
"CRITICAL_RANK" from the previous text. Note that it previously also mixed
the concepts of recovery and restoration. I've smoothed it to "restore"
perspective. Is this what we meant, though? And if so, should we also
include analogous "recovery" options?

Summary of what I've addressed in the questions/comments/etc below:

0. I have attempted a concise discussion (at the end of the introduction) of
the r^3 issue (recover, restore, rejoin). Pushing on through the details is
helping bring it into the light, but still needs work. If anyone has some
concise text that does so, please bring it on. And note that at this point I
have not verified that I adhere to these definitions :)

1. You are correct that the existing error handler should be able to handle
the ft requirements. It may require some text modification to the existing
error handling spec. Still working on all this...

2. Apologies for my confusion regarding MPI_Comm_size - yes, it continues to
return the number of processes in communicator as it was instantiated. (If
MPI_Comm_spawn is used to add processes, a new communicator is instantiated,
allowing MPI_Comm_size to continue as always.)

   My confusion stemmed from (among other things) the desire to want to know
the number of active processes. Current thinking is that the app must keep
track of this. Which brings up a question: do we want to provide another
function for returning this value?

3. I've added a subsection discussing the STATUS variable, mostly
cut-and-pasted from the pt2pt discussion, serving as placeholder/example;
needs targeted to FT issues as the prototype(s) develop.

4. Existing MPI_Comm_spawn section contains a discussion of the potential
resource manager issues.

5. David, please note a couple of questions regarding clarification of your
comments/questions inserted below. And made some corrections thanks to your
careful reading.

6. I believe the comments/questions/confusion discussed herein should be
"remembered" in a separate document discussing MPI FT.

>From here on this is getting messy, but it does provide a more complete way
to communicate my changes with regard to David's questions and our
interleaved discussion.

I'm pushing on, with thanks much to Greg for very insightful and useful
questions/comments/clarifications. That said, I may not have adequately
captured this in the current draft, but am working towards it and look
forward to more input.


On 10/9/09 11:13 AM, "Richard Barrett" <rbarrett at ornl.gov> wrote:

> Folks,
>    Awhile back David Solt sent some questions concerning the FT API to the
> working group mail list, and I've interleaved my thoughts on the topics. It
> seems I have more questions than answers, so I would appreciate your feedback
> prior to sending anything out to the list. David, thanks greatly for your
> careful reading of the draft. This is invaluable feedback, so much so that
> it¹s taken a long time to address, though still incompletely.
> General:
> 1. Discussion occurred during the recent telecon, though I don¹t believe I
> fully captured that discussion.
> 2. I will attempt a stronger discussion of rejoin, restore, recover (aka R^3).
> May be more wordy that a spec calls for, so may be pulled out later, included
> in a more discussion-like paper, but will need a crisp discussion in the spec.
> 3. I will attempt to address some of the issues in the draft spec, esp.
> ³Advice to² parts discussed below. However, your input greatly needed, now or
> once I put out a strawman.
> 4. I¹ve also mentioned some specific places I will modify the spec.
> 5. I¹ve also asked for anyone¹s input in many areas. And of course welcome it
> in any area.
> 6. And I¹ve also asked for clarification from David.
> Please keep in mind that I am from the user side, so please understand
> ignorant statements/questions regarding obvious system (and other) issues. I
> hope this is only to the extent that it encourages you to correct rather than
> ignore :) And finally, I¹ve been up 72 hours with about 6 dedicated to sleep,
> and since I am not nearly as resilient as Rich Graham, I am likely incoherent.
> (Obviously I lean heavily on excuses:)
> Richard
>> -----Original Message-----
>> From: Solt, David George
>> Sent: Tuesday, September 01, 2009 10:36 AM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: [Mpi3-ft] Questions about the the Proposed API document
>> Hi all,
>> Admittedly, I have missed a lot of discussion during recent months, so feel
>> free to ignore questions that have already been answered.
>> Thanks,
>> Dave
>>         a)  How is this different from using MPI_ERRORHANDLER_CREATE and
> There appears to be (close-to-) consensus that the current error handling
> functionality does, or should be able to, provide the mechanisms we are after.
> It seems the intent may not be specific to out work, but perhaps could be
> "modified" to allow for the FT functionality we¹ve been discussing. For
> example, the "Advice to Implementers" in the MPI error handling section (p277)
> talks about FT: "A good quality implementation will, to the greatest possible
> extent, circumscribe the impact of an error, so that normal processing can
> continue after an error handler was invoked."
> However, as currently written the MPI error handler seems designed to
> eliminate user participation in the process: "The purpose of these error
> handlers is to allow a user to issue user-defined error messages and to take
> action unrelated to MPI (such as flushing buffers) before a program exits."
> (p277) Otoh, there is text that seems to muddle that notion. Anyway, the
> apparent goal of what we¹ve written thus far is to allow the user to
> supplement error messages and handling internal to a FT mpi implementation.
> That is, among other things, the user has already attached FT attributes to
> communicator(s).
>> 1)  MPI_Comm_validate.
>>         a) Is it required that all callers return the same value for
>> failed_process_count and failed_ranks?
> This is collective, so all participants should receive the same values. That
> said, this points to a particular issue we need to address/discuss, ie
> additional failures that occur after a process returns from the call while
> other processes have not. Anyone have a paragraph or so for inclusion in an
> advice to users? And probably implementers as well?
>>         b) If ranks are partitioned into two groups A and B such that all
>> ranks in A can communicate and all ranks in B can communicate, but a rank in
>> A
>> cannot communicate with a rank in B, what should failed_ranks return for a
>> rank in A?  a rank in B?
> Does partition mean subcommunicators? If so, we¹re covered. If not, we¹re
> still covered :) If a rank cannot communicate with another, their is a
> failure, and it should be returned as such.
>>         c) I was told that MPI_Comm_validate uses a phased system such that
>> the result of the call is based on the callers' states prior to the call or
>> at
>> the start of the call but with the understanding that the results are not
>> guaranteed to be accurate at the return of the call.   Is this accurate?  If
>> so, can you show an example of where this call would either simplify an
>> application code or allow for a recovery case that would not be possible
>> without it?

"Advice to Users" inserted. Still in need of an example, although it may be
better places in a separate document.
> I believe this is correct, but need an example. Anyone?
>> 2) MPI_Comm_Ivalidate.
>>         a) Is there a way to wait on the resulting request in such a way that
>> you can access the failed_process_count and failed_ranks data?
> Oversight on our part ­ I will add the arrays so that this blocking and
> non-blocking apis are analogous to, for example, MPI_Recv and MPI_Irecv.

Done. Some cut and paste from the mpi_irecv section, so may still need

The alert reader will notice that I have (attempted to) replace the term
'process' with 'rank' when referring to it in that context. Eg

>> 3) MPI_Comm_restorable.
>>         a) Does this return count=0 for a rank that is already a member of
>> the
>> original application launched MPI_Comm_world?
>>         The following assumes that the answer to the above question is yes:
>> In order for this to have the data necessary, a "replacement" process must be
>> created through MPI_Comm_restore (i.e. user's can't bring their own
>> singletons
>> into existence through a scheduler, etc.)
> Sounds like a good idea. But are we missing something? Can you please
> elaborate?
>> 4) MPI_Comm_rejoin.
>>         a) Is this intended only to be used by a process that was not
>> previously a member of comm_names and the caller replaces an exited rank that
>> was a member of comm_names?
> Yes.
>>         b) Must MPI_Comm_rejoin and MPI_Comm_restore be used in matching way
>> between existing ranks and newly created ranks?  If ranks A and B call
>> MPI_Comm_restore, which creates a new replacement rank C, will the call to
>> MPI_Comm_restore hang until MPI_Comm_rejoin is called by C?
> Greg: Comm_rejoin called by any rank process - too generic. If I have a
> process created by comm_restore, can rejoin. Restore is about existence,
> rejoin about communicators. By default, restore gets a process into
> mpi_comm_world and comm_self. Otherwise, need to rejoin other comms. I¹ll try
> to make this clear in the text.

I corrected my text based on Rainer's discussion text - he is certainly
correct, imo, that the user of the communicator must be limited to
point-to-point communication unless and until the communicator is fully

>> 5) MPI_Comm_restore.
>>         a) Does this create processes (I have assumed so in Q#4b above)?   If
>> so, I suggest that we learn from the problem with MPI_Comm_spawn from MPI-2
>> that interaction with a scheduler should be considered as we develop the API.
> We need to better understand the comm_spawn issue. Current schedulers won't
> let you change size of resource allocated. Need "Advice to Users" in
> MPI_Comm_spawn section as well as FT in order to ³alert² the user to the
> potential issues.

Misc: I moved the "Advice to implementors" up from the non-blocking to the
first function involved in restoration, since the advice applies to _all_
such functionality. Agreed?

>> 6) MPI_Comm_proc_gen/MPI_Comm_gen
>>         a) The name MPI_Comm_proc_gen seems like it should be MPI_Proc_gen.
>> I see that all other routines are prefixed with MPI_Comm_, but I think that
>> they all genuinely involve aspects of a communicator except for this one.
> Agreed, it is not as much about communicators as other routines. So where
> should this then go? ³Process Creation and Mgmt² seems appropriate, but that
> section seems ignored by most, and besides, the statement is that  ³If
> restored by MPI fault tolerance capabilities, the process generation is
> incremented by one. The initial generation is zero.² To me then this would be
> useful for tracking the fault and resilience history of a particular run. From
> that it then makes sense to me to include it in the FT chapter since it is
> really a ³utility² for that functionality. Comments?
>>         b) Ranks 0,1,2 are part of a restorable communicator C.  Rank 2 dies.
>> Rank 0 calls MPI_Comm_restore.   Is rank 1 obligated to make any calls before
>> using communicator C successfully?   What will MPI_Comm_proc_gen return for
>> rank 0? 1?  2?   What will MPI_Comm_gen return for rank 0? 1? 2?
> I believe the immediate above discussion addresses the latter two questions.
> But does point to the need to strengthen the text (which Rainer calls for in
> the discussion text), which I¹ll try to do. Perhaps enough to say that this is
> a local call? Regarding the first question, this is part of the, should I say,
> ³race condition² issue discussed above? The communicator is ³broken², which
> rank 1 will be informed of when it make a call to the communicator. So my
> question then: MPI_Comm_restore is collective, and therefore all processes in
> the communicator must call it before any can return. What if a rank rarely
> (legal) if ever (bad code) actually contacts that rank? But wait, that¹s
> certainly the situation for MPI_COMM_WORLD.
>> 7) General question:
>>         a) If rank x fails to communicate using point-to-point communication
>> to rank y over communicator C, is it guaranteed that any collective call made
>> by rank x or y on communicator C will immediately fail (even if the path
>> between x and y is not used for the collective)?  (or is it up to the
>> implementation)
> Sounds like an undefined situation (i.e. A rank is a member of a communicator
> across which it doesn¹t communicate). Implementation dependent behavior, then?
> Which calls for an ³Advice to U/I².
>> Some more questions for us to think about.  It is quite possible that I have
>> some fundamental flaws in my thinking that make some, many or all of these
>> questions invalid.  So, I ask that if anyone sees a basic fallacy in my view
>> of how these calls are intended to work that you point that out to me first
>> and I can review my questions and see if there are still issues that do not
>> make sense to me.
> My view is that any confusion you might have is certainly from an educated
> perspective, so must be addressed/clarified.
>> 8) If MPI_Comm_rejoin passes MPI_COMM_WORLD, then does it really change its
>> size of MPI_COMM_WORLD?
> My understanding is that we don¹t want to ³replace² ranks unless and until the
> user makes it so. So new ranks may append to the set of ranks, which included
> failed ones. But MPI_COMM_SIZE will return the number of active ranks. Did I
> say this correctly?

Per Greg's correction and discussions with Rich, MPI_Comm_size continues to
behave as it should :) Ie it will return the number of ranks from the
originally instantiated communicator. But should we have a function that
returns the number of active communicators? As it is, it is up to the
application to maintain this information, perhaps as it should be. Comments?
>> 9) Why does MPI_Comm_restore take an array of ranks to restore?   Shouldn't
>> the restored ranks be based on MPI_PROC_RESTORE_POLICY? Or maybe a better way
>> to ask: "What call does MPI_PROC_RESTORE_POLICY influence?"
> Calls for a ³Rationale² or ³Advice² text imo. I don¹t believe we want an API
> that takes an array in one case and a scalar in another. Analogous to
> transmission of data. But again, am I understanding the question?
>> 10) The API document says this regarding MPI_Comm_Irestore: "It is local in
>> scope, and thus restores local communications (point-to-point, one-sided,
>> data-type creation, etc.), but not collective communications."  If this is
>> true, then how do you restore collective communications?   Can you then go on
>> to collectively call MPI_Comm_restore_all?   If you do, would every rank need
>> to specify that no new ranks are to be created since they have already been
>> created by the earlier call to MPI_Comm_restore?  Also, I don't think it is
>> *really* local in scope, if it was, there would be no reason to have a
>> non-blocking version.
> I expect much of this will be clarified with a stronger discussion of R^3
> (mentioned above).But the non-blocking to my understanding is designed to
> execution to continue while restoration takes place. For example, a Monte
> Carlo code can simply say, ³that process is not there so I will process
> without its data. But I would like OEit¹ back if possible, so I¹ll check later.
>> 11) MPI_COMM_REJOIN - It seems like the resulting communicator should be
>> collective-capable if the calling process was created through a call to
>> MPI_Comm_restore_all and not collective-capable if created through a call to
>> MPI_Comm_restore?  If we go with that, there should be a way for the caller
>> of
>> MPI_Comm_rejoin to know the status of the communicator with respect to
>> collectives.
> Another that will become clear with a stronger discussion of R^3 (he says
> optimistically, or is that lackadaisically? :)
>> 12) MPI_Comm_restore specifies ranks of communicators that should be
>> restored.
>> I assume it will block until the restored ranks call MPI_Comm_rejoin on those
>> communicators?   (I say that because of the line "[MPI_Comm_restore ] is
>> local
>> in scope, and thus restores local communications...".  Restoring local
>> communications to who?   I assume to the process created by MPI_Comm_restore?
>> If it does not block, how does it know when it can safely talk to the
>> restored
>> ranks using any of the communicators they are in?   So, I assume it blocks.
>> That seems to imply that the restored rank MUST call MPI_Comm_rejoin on the
>> communicator referenced in its creation.   If rank 0 calls MPI_Comm_restore()
>> and passes in Rank 3 of communicator FOO, then the restored process must call
>> MPI_Comm_rejoin on communicator FOO.  But when the restored rank 3 calls
>> MPI_Comm_recoverable, it could return several communicators and rank 3 has no
>> idea that it MUST call MPI_Comm_rejoin on some, but is not req!
>>  uired to call MPI_Comm_rejoin on others?
> If it blocks, the process can¹t call MPI_Comm_rejoin. So do we have a
> contradiction in the spec? Here¹s to hoping that clarification of R^3 makes
> this question go away :)
>> 13) What does MPI_COMM_WORLD look like before the new process calls
>> MPI_COMM_REJOIN.  If the process was created through a call to
>> MPI_Comm_restore that specified multiple ranks to be restored, are all of
>> those ranks together in an MPI_COMM_WORLD until they call MPI_Comm_rejoin?
>> Is
>> the MPI_Comm_rejoin call collective across all of those newly created
>> processes or can they all call one at a time at their leisure?
> A ³broken² communication will have invalid ranks, revealed to the calling
> process when it attempts to use the rank, and in the manner specified by the
> FT configuration (default or user specified). But again, the R^3 text will
> have to clear this up. Also, see the ³Discussion² text in the spec for an
> additional issue. And regarding MPI_COMM_WORLD (and MPI_COMM_SELF), these are
> by default intended to be restored (from a local view). Hmmm, that statement
> confuses even me. So he punts back to R^3.
>> 14) Is there anything we are proposing with MPI_Comm_rejoin/restore that
>> cannot be accomplished with MPI_Comm_spawn, MPI_Comm_merge?  The only thing I
>> can think of is that MPI_COMM_WORLD cannot be "fixed" using
>> MPI_Comm_spawn/merge, but only because it is a constant.
> My understanding is that we are attempting to bridge the gap between an
> ³invisibly² fault tolerant implementation and a fully user controlled scheme,
> where that gap may be small (or non-existent?) to large.
>> 15) ranks_to_restore struct is not defined in the version of API I have.
> I¹ve confused myself with my attempt to be clever. Is this the ³Who¹s on first
> and what¹s on second² discussion (note Abbott and Costello reference), meaning
> I didn¹t understand either so gave placeholder names. Does your copy include
> this discussion?
>> 16) MPI_Comm_restore seems to be based on the idea that some ranks have
>> exited.   What if rank A cannot talk to rank B, but rank B still exists and
>> can talk to rank C?  What does it mean to restore a rank in this case?  None
>> of the ranks are gone, they are just having communication problems.   It
>> seems
>> like there should be some way to come up with a failure free set of ranks
>> such
>> that all the ranks in the set can communicate across all process pairs.
> My understanding is that a process in a communicator that cannot communicate
> with all processes in that communicator indicates a fault. But who is at fault
> may be the appropriate question that you are asking. R^3 discussion? Your last
> sentence, however, seems to me to point to a new communicator the user would
> have to create.
>> 17) Ranks 0, 1, & 2 are in Comm FOO. Rank 2 dies.   Rank 0 calls
>> MPI_Comm_restore({FOO,2}) and can now communicate with 2 once again using
>> point-to-point calls?   Is there a way that 1 can ever restore communication
>> to the new rank 2?   I believe the only way is that all ranks (including the
>> new rank 2) collectively call MPI_Comm_restore({})?  I'm not sure that is a
>> problem, but I wanted to check my understanding of how these calls work.
> The the first answer that popped into my head contradicts something I said
> above (ie the communicator is broken). So R^3?
> Just my pass at addressing a broad set of issues. Please, please, please,
> don¹t try to spare my feelings, just view this as the start of what should be
> a storng disucssion.
> Richard

  Richard Barrett
  Application Performance Tools group
  Computer Science and Mathematics Division
  Oak Ridge National Laboratory


-------------- next part --------------
A non-text attachment was scrubbed...
Name: api_doc.pdf
Type: application/octet-stream
Size: 145620 bytes
Desc: api_doc.pdf
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20091016/69e64bdb/attachment.obj>

More information about the mpiwg-ft mailing list