[Mpi3-ft] <no subject>

Barrett, Richard F. rbarrett at ornl.gov
Wed Oct 7 11:33:45 CDT 2009

> -----Original Message-----
> From: Solt, David George
> Sent: Tuesday, September 01, 2009 10:36 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] Questions about the the Proposed API document
> Hi all,
> Admittedly, I have missed a lot of discussion during recent months, so feel
> free to ignore questions that have already been answered.
> Thanks,
> Dave
>         a)  How is this different from using MPI_ERRORHANDLER_CREATE and
> 1)  MPI_Comm_validate.
>         a) Is it required that all callers return the same value for
> failed_process_count and failed_ranks?

>         b) If ranks are partitioned into two groups A and B such that all
> ranks in A can communicate and all ranks in B can communicate, but a rank in A
> cannot communicate with a rank in B, what should failed_ranks return for a
> rank in A?  a rank in B?
>         c) I was told that MPI_Comm_validate uses a phased system such that
> the result of the call is based on the callers' states prior to the call or at
> the start of the call but with the understanding that the results are not
> guaranteed to be accurate at the return of the call.   Is this accurate?  If
> so, can you show an example of where this call would either simplify an
> application code or allow for a recovery case that would not be possible
> without it?
> 2) MPI_Comm_Ivalidate.
>         a) Is there a way to wait on the resulting request in such a way that
> you can access the failed_process_count and failed_ranks data?
> 3) MPI_Comm_restorable.
>         a) Does this return count=0 for a rank that is already a member of the
> original application launched MPI_Comm_world?
>         The following assumes that the answer to the above question is yes:
> In order for this to have the data necessary, a "replacement" process must be
> created through MPI_Comm_restore (i.e. user's can't bring their own singletons
> into existence through a scheduler, etc.)
> 4) MPI_Comm_rejoin.
>         a) Is this intended only to be used by a process that was not
> previously a member of comm_names and the caller replaces an exited rank that
> was a member of comm_names?
>         b) Must MPI_Comm_rejoin and MPI_Comm_restore be used in matching way
> between existing ranks and newly created ranks?  If ranks A and B call
> MPI_Comm_restore, which creates a new replacement rank C, will the call to
> MPI_Comm_restore hang until MPI_Comm_rejoin is called by C?
> 5) MPI_Comm_restore.
>         a) Does this create processes (I have assumed so in Q#4b above)?   If
> so, I suggest that we learn from the problem with MPI_Comm_spawn from MPI-2
> that interaction with a scheduler should be considered as we develop the API.
> 6) MPI_Comm_proc_gen/MPI_Comm_gen
>         a) The name MPI_Comm_proc_gen seems like it should be MPI_Proc_gen.
> I see that all other routines are prefixed with MPI_Comm_, but I think that
> they all genuinely involve aspects of a communicator except for this one.
>         b) Ranks 0,1,2 are part of a restorable communicator C.  Rank 2 dies.
> Rank 0 calls MPI_Comm_restore.   Is rank 1 obligated to make any calls before
> using communicator C successfully?   What will MPI_Comm_proc_gen return for
> rank 0? 1?  2?   What will MPI_Comm_gen return for rank 0? 1? 2?
> 7) General question:
>         a) If rank x fails to communicate using point-to-point communication
> to rank y over communicator C, is it guaranteed that any collective call made
> by rank x or y on communicator C will immediately fail (even if the path
> between x and y is not used for the collective)?  (or is it up to the
> implementation)
> Some more questions for us to think about.  It is quite possible that I have
> some fundamental flaws in my thinking that make some, many or all of these
> questions invalid.  So, I ask that if anyone sees a basic fallacy in my view
> of how these calls are intended to work that you point that out to me first
> and I can review my questions and see if there are still issues that do not
> make sense to me.
> 8) If MPI_Comm_rejoin passes MPI_COMM_WORLD, then does it really change its
> size of MPI_COMM_WORLD?
> 9) Why does MPI_Comm_restore take an array of ranks to restore?   Shouldn't
> the restored ranks be based on MPI_PROC_RESTORE_POLICY? Or maybe a better way
> to ask: "What call does MPI_PROC_RESTORE_POLICY influence?"
> 10) The API document says this regarding MPI_Comm_Irestore: "It is local in
> scope, and thus restores local communications (point-to-point, one-sided,
> data-type creation, etc.), but not collective communications."  If this is
> true, then how do you restore collective communications?   Can you then go on
> to collectively call MPI_Comm_restore_all?   If you do, would every rank need
> to specify that no new ranks are to be created since they have already been
> created by the earlier call to MPI_Comm_restore?  Also, I don't think it is
> *really* local in scope, if it was, there would be no reason to have a
> non-blocking version.
> 11) MPI_COMM_REJOIN - It seems like the resulting communicator should be
> collective-capable if the calling process was created through a call to
> MPI_Comm_restore_all and not collective-capable if created through a call to
> MPI_Comm_restore?  If we go with that, there should be a way for the caller of
> MPI_Comm_rejoin to know the status of the communicator with respect to
> collectives.
> 12) MPI_Comm_restore specifies ranks of communicators that should be restored.
> I assume it will block until the restored ranks call MPI_Comm_rejoin on those
> communicators?   (I say that because of the line "[MPI_Comm_restore ] is local
> in scope, and thus restores local communications...".  Restoring local
> communications to who?   I assume to the process created by MPI_Comm_restore?
> If it does not block, how does it know when it can safely talk to the restored
> ranks using any of the communicators they are in?   So, I assume it blocks.
> That seems to imply that the restored rank MUST call MPI_Comm_rejoin on the
> communicator referenced in its creation.   If rank 0 calls MPI_Comm_restore()
> and passes in Rank 3 of communicator FOO, then the restored process must call
> MPI_Comm_rejoin on communicator FOO.  But when the restored rank 3 calls
> MPI_Comm_recoverable, it could return several communicators and rank 3 has no
> idea that it MUST call MPI_Comm_rejoin on some, but is not req!
>  uired to call MPI_Comm_rejoin on others?
> 13) What does MPI_COMM_WORLD look like before the new process calls
> MPI_COMM_REJOIN.  If the process was created through a call to
> MPI_Comm_restore that specified multiple ranks to be restored, are all of
> those ranks together in an MPI_COMM_WORLD until they call MPI_Comm_rejoin?  Is
> the MPI_Comm_rejoin call collective across all of those newly created
> processes or can they all call one at a time at their leisure?
> 14) Is there anything we are proposing with MPI_Comm_rejoin/restore that
> cannot be accomplished with MPI_Comm_spawn, MPI_Comm_merge?  The only thing I
> can think of is that MPI_COMM_WORLD cannot be "fixed" using
> MPI_Comm_spawn/merge, but only because it is a constant.
> 15) ranks_to_restore struct is not defined in the version of API I have.
> 16) MPI_Comm_restore seems to be based on the idea that some ranks have
> exited.   What if rank A cannot talk to rank B, but rank B still exists and
> can talk to rank C?  What does it mean to restore a rank in this case?  None
> of the ranks are gone, they are just having communication problems.   It seems
> like there should be some way to come up with a failure free set of ranks such
> that all the ranks in the set can communicate across all process pairs.
> 17) Ranks 0, 1, & 2 are in Comm FOO. Rank 2 dies.   Rank 0 calls
> MPI_Comm_restore({FOO,2}) and can now communicate with 2 once again using
> point-to-point calls?   Is there a way that 1 can ever restore communication
> to the new rank 2?   I believe the only way is that all ranks (including the
> new rank 2) collectively call MPI_Comm_restore({})?  I'm not sure that is a
> problem, but I wanted to check my understanding of how these calls work.

More information about the mpiwg-ft mailing list