[Mpi3-ft] Questions about the the Proposed API document

Solt, David George david.solt at hp.com
Thu Sep 3 16:24:13 CDT 2009

Some more questions for us to think about.  It is quite possible that I have some fundamental flaws in my thinking that make some, many or all of these questions invalid.  So, I ask that if anyone sees a basic fallacy in my view of how these calls are intended to work that you point that out to me first and I can review my questions and see if there are still issues that do not make sense to me.

8) If MPI_Comm_rejoin passes MPI_COMM_WORLD, then does it really change its size of MPI_COMM_WORLD?

9) Why does MPI_Comm_restore take an array of ranks to restore?   Shouldn't the restored ranks be based on MPI_PROC_RESTORE_POLICY? Or maybe a better way to ask: "What call does MPI_PROC_RESTORE_POLICY influence?"

10) The API document says this regarding MPI_Comm_Irestore: "It is local in scope, and thus restores local communications (point-to-point, one-sided, data-type creation, etc.), but not collective communications."  If this is true, then how do you restore collective communications?   Can you then go on to collectively call MPI_Comm_restore_all?   If you do, would every rank need to specify that no new ranks are to be created since they have already been created by the earlier call to MPI_Comm_restore?  Also, I don't think it is *really* local in scope, if it was, there would be no reason to have a non-blocking version.

11) MPI_COMM_REJOIN - It seems like the resulting communicator should be collective-capable if the calling process was created through a call to MPI_Comm_restore_all and not collective-capable if created through a call to MPI_Comm_restore?  If we go with that, there should be a way for the caller of MPI_Comm_rejoin to know the status of the communicator with respect to collectives.   

12) MPI_Comm_restore specifies ranks of communicators that should be restored.   I assume it will block until the restored ranks call MPI_Comm_rejoin on those communicators?   (I say that because of the line "[MPI_Comm_restore ] is local in scope, and thus restores local communications...".  Restoring local communications to who?   I assume to the process created by MPI_Comm_restore?  If it does not block, how does it know when it can safely talk to the restored ranks using any of the communicators they are in?   So, I assume it blocks.  That seems to imply that the restored rank MUST call MPI_Comm_rejoin on the communicator referenced in its creation.   If rank 0 calls MPI_Comm_restore() and passes in Rank 3 of communicator FOO, then the restored process must call MPI_Comm_rejoin on communicator FOO.  But when the restored rank 3 calls MPI_Comm_recoverable, it could return several communicators and rank 3 has no idea that it MUST call MPI_Comm_rejoin on some, but is not required to call MPI_Comm_rejoin on others?

13) What does MPI_COMM_WORLD look like before the new process calls MPI_COMM_REJOIN.  If the process was created through a call to MPI_Comm_restore that specified multiple ranks to be restored, are all of those ranks together in an MPI_COMM_WORLD until they call MPI_Comm_rejoin?  Is the MPI_Comm_rejoin call collective across all of those newly created processes or can they all call one at a time at their leisure?   

14) Is there anything we are proposing with MPI_Comm_rejoin/restore that cannot be accomplished with MPI_Comm_spawn, MPI_Comm_merge?  The only thing I can think of is that MPI_COMM_WORLD cannot be "fixed" using MPI_Comm_spawn/merge, but only because it is a constant.

15) ranks_to_restore struct is not defined in the version of API I have.

16) MPI_Comm_restore seems to be based on the idea that some ranks have exited.   What if rank A cannot talk to rank B, but rank B still exists and can talk to rank C?  What does it mean to restore a rank in this case?  None of the ranks are gone, they are just having communication problems.   It seems like there should be some way to come up with a failure free set of ranks such that all the ranks in the set can communicate across all process pairs.

17) Ranks 0, 1, & 2 are in Comm FOO. Rank 2 dies.   Rank 0 calls MPI_Comm_restore({FOO,2}) and can now communicate with 2 once again using point-to-point calls?   Is there a way that 1 can ever restore communication to the new rank 2?   I believe the only way is that all ranks (including the new rank 2) collectively call MPI_Comm_restore({})?  I'm not sure that is a problem, but I wanted to check my understanding of how these calls work.    


-----Original Message-----
From: Solt, David George 
Sent: Tuesday, September 01, 2009 10:36 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] Questions about the the Proposed API document

Hi all,

Admittedly, I have missed a lot of discussion during recent months, so feel free to ignore questions that have already been answered.  



	a)  How is this different from using MPI_ERRORHANDLER_CREATE and MPI_COMM_ERRHANDLER_SET

1)  MPI_Comm_validate.  

	a) Is it required that all callers return the same value for failed_process_count and failed_ranks?

	b) If ranks are partitioned into two groups A and B such that all ranks in A can communicate and all ranks in B can communicate, but a rank in A cannot communicate with a rank in B, what should failed_ranks return for a rank in A?  a rank in B?

	c) I was told that MPI_Comm_validate uses a phased system such that the result of the call is based on the callers' states prior to the call or at the start of the call but with the understanding that the results are not guaranteed to be accurate at the return of the call.   Is this accurate?  If so, can you show an example of where this call would either simplify an application code or allow for a recovery case that would not be possible without it?  

2) MPI_Comm_Ivalidate.  

	a) Is there a way to wait on the resulting request in such a way that you can access the failed_process_count and failed_ranks data?

3) MPI_Comm_restorable.
	a) Does this return count=0 for a rank that is already a member of the original application launched MPI_Comm_world?  
	The following assumes that the answer to the above question is yes:  In order for this to have the data necessary, a "replacement" process must be created through MPI_Comm_restore (i.e. user's can't bring their own singletons into existence through a scheduler, etc.)

4) MPI_Comm_rejoin.
	a) Is this intended only to be used by a process that was not previously a member of comm_names and the caller replaces an exited rank that was a member of comm_names?   

	b) Must MPI_Comm_rejoin and MPI_Comm_restore be used in matching way between existing ranks and newly created ranks?  If ranks A and B call MPI_Comm_restore, which creates a new replacement rank C, will the call to MPI_Comm_restore hang until MPI_Comm_rejoin is called by C?

5) MPI_Comm_restore.

	a) Does this create processes (I have assumed so in Q#4b above)?   If so, I suggest that we learn from the problem with MPI_Comm_spawn from MPI-2 that interaction with a scheduler should be considered as we develop the API.

6) MPI_Comm_proc_gen/MPI_Comm_gen

	a) The name MPI_Comm_proc_gen seems like it should be MPI_Proc_gen.   I see that all other routines are prefixed with MPI_Comm_, but I think that they all genuinely involve aspects of a communicator except for this one.

	b) Ranks 0,1,2 are part of a restorable communicator C.  Rank 2 dies.  Rank 0 calls MPI_Comm_restore.   Is rank 1 obligated to make any calls before using communicator C successfully?   What will MPI_Comm_proc_gen return for rank 0? 1?  2?   What will MPI_Comm_gen return for rank 0? 1? 2?

7) General question:  

	a) If rank x fails to communicate using point-to-point communication to rank y over communicator C, is it guaranteed that any collective call made by rank x or y on communicator C will immediately fail (even if the path between x and y is not used for the collective)?  (or is it up to the implementation)


More information about the mpiwg-ft mailing list