[Mpi3-ft] mpi3-ft Digest, Vol 20, Issue 2

Mon Oct 12 14:30:10 CDT 2009

> >         b) Ranks 0,1,2 are part of a restorable communicator 
> C.  Rank 2 dies.
> > Rank 0 calls MPI_Comm_restore.   Is rank 1 obligated to make any 
> calls before
> > using communicator C successfully?   What will MPI_Comm_proc_gen return for
> > rank 0? 1?  2?   What will MPI_Comm_gen return for rank 0? 1? 2?
>
>I believe the immediate above discussion addresses the latter two 
>questions. But does point to the need to strengthen the text (which 
>Rainer calls for in the discussion text), which I'll try to do. 
>Perhaps enough to say that this is a local call? Regarding the first 
>question, this is part of the, should I say, "race condition" issue 
>discussed above? The communicator is "broken", which rank 1 will be 
>informed of when it make a call to the communicator. So my question 
>then: MPI_Comm_restore is collective, and therefore all processes in 
>the communicator must call it before any can return. What if a rank 
>rarely (legal) if ever (bad code) actually contacts that rank? But 
>wait, that's certainly the situation for MPI_COMM_WORLD.

The way I was thinking about this situation is that MPI_Comm_restore 
is not collective and once a process knows that another has called 
it, it does not have to call it on its own. However, at least one 
process must call MPI_Comm_restore in order for it to be restored and 
MPI_Comm_restore can be called any number of times by any number of 
processes with no ill effects.

The motivation behind this API is that we don't want to force all 
members of a communicator to be involved in recovery (definitely a 
bad idea for MPI_COMM_WORLD) but we need a way to control the fact 
that multiple processes may try to call MPI_Comm_restore 
simultaneously. The intuition is that process restoration is an 
idempotent operation for each failure that can be issued any number 
of times with the same effect. However, if the rank suffers from 
multiple failures, this API may cause application processes to become 
confused about the restored process' generation. Fortunately, the 
application has two ways to avoid getting confused:
- coordinate the calls to MPI_Comm_restore on its own
- have the restored process query its own generation and send it to 
others when this information is needed

> > 8) If MPI_Comm_rejoin passes MPI_COMM_WORLD, then does it really change its
> > size of MPI_COMM_WORLD?
>
>My understanding is that we don't want to "replace" ranks unless and 
>until the user makes it so. So new ranks may append to the set of 
>ranks, which included failed ones. But MPI_COMM_SIZE will return the 
>number of active ranks. Did I say this correctly?
I'm confused. I thought MPI_Comm_size would return the number of 
active and inactive ranks. It would be confusing otherwise since you 
could get 5 active ranks but a rank space that ranges from 0 to 9.

> > 10) The API document says this regarding MPI_Comm_Irestore: "It is local in
> > scope, and thus restores local communications (point-to-point, one-sided,
> > data-type creation, etc.), but not collective communications."  If this is
> > true, then how do you restore collective communications?   Can 
> you then go on
> > to collectively call MPI_Comm_restore_all?   If you do, would 
> every rank need
> > to specify that no new ranks are to be created since they have already been
> > created by the earlier call to MPI_Comm_restore?  Also, I don't think it is
> > *really* local in scope, if it was, there would be no reason to have a
> > non-blocking version.
>
>I expect much of this will be clarified with a stronger discussion 
>of R^3 (mentioned above).But the non-blocking to my understanding is 
>designed to execution to continue while restoration takes place. For 
>example, a Monte Carlo code can simply say, "that process is not 
>there so I will process without its data. But I would like 'it' back 
>if possible, so I'll check later.
Isn't this what MPI_Comm_recover_collective() is for?

> > 12) MPI_Comm_restore specifies ranks of communicators that should 
> be restored.
> > I assume it will block until the restored ranks call 
> MPI_Comm_rejoin on those
> > communicators?   (I say that because of the line 
> "[MPI_Comm_restore ] is local
> > in scope, and thus restores local communications...".  Restoring local
> > communications to who?   I assume to the process created by 
> MPI_Comm_restore?
> > If it does not block, how does it know when it can safely talk to 
> the restored
> > ranks using any of the communicators they are in?   So, I assume it blocks.
> > That seems to imply that the restored rank MUST call MPI_Comm_rejoin on the
> > communicator referenced in its creation.   If rank 0 calls 
> MPI_Comm_restore()
> > and passes in Rank 3 of communicator FOO, then the restored 
> process must call
> > MPI_Comm_rejoin on communicator FOO.  But when the restored rank 3 calls
> > MPI_Comm_recoverable, it could return several communicators and 
> rank 3 has no
> > idea that it MUST call MPI_Comm_rejoin on some, but is not req!
> >  uired to call MPI_Comm_rejoin on others?
>
>If it blocks, the process can't call MPI_Comm_rejoin. So do we have 
>a contradiction in the spec? Here's to hoping that clarification of 
>R^3 makes this question go away :)
I think that MPI_Comm_restore should return at the point where 
subsequent messages sent to the failed process will not cause a 
failure unless that process fails again. In other words, it should be 
like MPI_Send which doesn't guarantee delivery but does guarantee 
that the sender's job is finished. Similarly, MPI_Comm_restore should 
submit a restoration request to the MPI runtime, which takes care of 
everything else. Messages to the process being restored are buffered 
until it is actually restored. If the application sends messages to 
the restored process but it does not rejoin the communicator or 
receive these messages, then the application will quickly run out of 
buffer space. The only issue left is that the application has no way 
to control the amount of time it takes to restore a process and thus, 
the amount of buffer space. I suggest that if a process tries to send 
a message to a process that is being restored and runs out of buffer 
space, it should hang until the destination process is ready to 
receive the messages.

> > 13) What does MPI_COMM_WORLD look like before the new process calls
> > MPI_COMM_REJOIN.  If the process was created through a call to
> > MPI_Comm_restore that specified multiple ranks to be restored, are all of
> > those ranks together in an MPI_COMM_WORLD until they call 
> MPI_Comm_rejoin?  Is
> > the MPI_Comm_rejoin call collective across all of those newly created
> > processes or can they all call one at a time at their leisure?
>
>A "broken" communication will have invalid ranks, revealed to the 
>calling process when it attempts to use the rank, and in the manner 
>specified by the FT configuration (default or user specified). But 
>again, the R^3 text will have to clear this up. Also, see the 
>"Discussion" text in the spec for an additional issue. And regarding 
>MPI_COMM_WORLD (and MPI_COMM_SELF), these are by default intended to 
>be restored (from a local view). Hmmm, that statement confuses even 
>me. So he punts back to R^3.
That makes perfect sense. MPI Calls MPI_Rejoin on MPI_COMM_WORLD and 
MPI_COMM_SELF automatically and leaves the rest to the restored 
process. If some members of these communicators have failed, 
communication to them will also fail like normal.

> > 14) Is there anything we are proposing with MPI_Comm_rejoin/restore that
> > cannot be accomplished with MPI_Comm_spawn, MPI_Comm_merge?  The 
> only thing I
> > can think of is that MPI_COMM_WORLD cannot be "fixed" using
> > MPI_Comm_spawn/merge, but only because it is a constant.
>
>My understanding is that we are attempting to bridge the gap between 
>an "invisibly" fault tolerant implementation and a fully user 
>controlled scheme, where that gap may be small (or non-existent?) to large.
I think the biggest improvement is the process respawning and 
communicator rejoining functionality, which is much easier to use and 
requires less synchronization than if the user did it on their own 
using existing APIs.

> > 16) MPI_Comm_restore seems to be based on the idea that some ranks have
> > exited.   What if rank A cannot talk to rank B, but rank B still exists and
> > can talk to rank C?  What does it mean to restore a rank in this 
> case?  None
> > of the ranks are gone, they are just having communication 
> problems.   It seems
> > like there should be some way to come up with a failure free set 
> of ranks such
> > that all the ranks in the set can communicate across all process pairs.
>
>My understanding is that a process in a communicator that cannot 
>communicate with all processes in that communicator indicates a 
>fault. But who is at fault may be the appropriate question that you 
>are asking. R^3 discussion? Your last sentence, however, seems to me 
>to point to a new communicator the user would have to create.
If some communication fault causes MPI on one process to conclude 
that another has died and to inform the application of this fact, it 
is responsible for maintaining this illusion by killing somebody off 
when communication is restored.

> > 17) Ranks 0, 1, & 2 are in Comm FOO. Rank 2 dies.   Rank 0 calls
> > MPI_Comm_restore({FOO,2}) and can now communicate with 2 once again using
> > point-to-point calls?   Is there a way that 1 can ever restore 
> communication
> > to the new rank 2?   I believe the only way is that all ranks 
> (including the
> > new rank 2) collectively call MPI_Comm_restore({})?  I'm not sure that is a
> > problem, but I wanted to check my understanding of how these calls work.
>
>The the first answer that popped into my head contradicts something 
>I said above (ie the communicator is broken). So R^3?
>
>Just my pass at addressing a broad set of issues. Please, please, 
>please, don't try to spare my feelings, just view this as the start 
>of what should be a storng disucssion.

I think my description of MPI_Comm_restore covers this case. If rank 
1 is sure that rank 0 has already called MPI_Comm_restore(2), then it 
can safely communicate. If not, then it has to call 
MPI_Comm_restore(2) on its own.

Greg Bronevetsky
Computer Scientist
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com