[Mpi3-ft] mpi3-ft Digest, Vol 20, Issue 2

Greg Bronevetsky bronevetsky1 at llnl.gov
Mon Oct 26 14:55:00 CDT 2009

At 12:10 PM 10/22/2009, Solt, David George wrote:
>At this point I am speaking heavily about implementation (apologies 
>to those who are not interested in implementation details, I can 
>take this offline with those who are or are planning to work on an 
>example implementation?)  I want to make sure the API doesn't 
>require something that is not possible or exceedingly difficult to implement.
There are all good questions that need to be answered before we're 
certain that the specification is implementable.

>"If one rank can't communicate with all other ranks, it should be 
>considered as failed" would imply that if the path between rank 0 
>and 1 is known to be broken, but rank 2 can communicate with both 0 
>and 1, then we would want MPI_Comm_validate to return that there are 
>2 failed processes when called (0 and 1).  It almost has to be that 
>way doesn't it?   We can't "fix" the communication problem, we can 
>only request that processes be restored/re-started and both 0 and 1 
>are indistinguishable in terms of their current states (both cannot 
>talk to exactly one other rank in the communicator).
>My suggestion in this case, is that MPI (rank 2 specifically) 
>aggressively kill off rank 0 and 1 during the MPI_Comm_validate 
>call. This may sound extreme, but if rank 0 goes on to call 
>MPI_Comm_restore(rank 1), it may not know that rank 1 still exists 
>and it may not be able to reach rank 1 to kill it.  We would end up 
>with two copies of rank 1.  MPI_Comm_restore() could use some 
>centralized coordinator that is responsible for killing off the old 
>rank 1, but maybe that coordinator is running on a node that also 
>cannot communicate with the node rank 1 is on. During the 
>collective, MPI_Comm_validate there is a window of opportunity where 
>ranks that can reach failed ranks, can kill them off to avoid 
>duplicates copies of ranks.  If no rank can talk to a failed rank, 
>then the rank is effectively the same as a rank that exited early 
>and can be ignored.  This is mostly an implementation detail, but if 
>it is allowed, we would want to document in the spec that 
>MPI_Comm_validate is al!
>  lowed to kill ranks that are experiencing communication failures.
Your solution sounds exactly right, except that in this case we'll 
only need to kill off either rank 0 or rank 1, not both. We just need 
to have a subset of the communicator where everybody can talk to 
everybody else. We should document this behavior as part of advice to 
implementors and users. This killing-off behavior will usually be 
transparent to users, since they'll just see a failure that is just 
as weird as a regular failure. However, they may be confused into 
thinking that they have a bug in their application, so we should talk 
about it explicitly.

>A centralized coordinator for MPI_Comm_restore makes things easier 
>for me to envision.  Even there, I wonder if we have enough 
>information for the centralized-coordinator to answer the question, 
>"Do I need to restore this process or did my previous restore of 
>this process satisfy this request".   How do we distinguish between 
>"two ranks asked to restore the same process" and "one rank asked to 
>restore a process, it was restored and now it is failed again, and 
>another process is now asking to restore it".   I think the 
>centralized coordinator has to have the ability to interrogate an 
>existing process directly and find out its current status.  That may 
>require an extra thread or watchdog process or signal handler to 
>ensure that the process being interrogated can respond even if it is 
>not making MPI calls.  If we have that ability, we can do the 
>following inside the central-coordinator:
Your idea is exactly right. Having a certralized but replicated 
coordinator is probably the best way to go. In fact, this 
functionality should probably be integrated into the scheduler. If a 
process has been restored but it no longer functioning, it should 
also be explicitly killed off before a new one is spawned.

>If we do not trust that central-coordinator or otherwise think this 
>approach will not work, might we need MPI_Comm_validate to return 
>the current generation number of each rank and MPI_Comm_restore to 
>be able to specify a generation number?   That way it would be 
>easier to determine if it is the first MPI_Comm_restore call 
>targeting a specific generation of rank x?   Just some 
>thoughts.   I'm not actually implementing any of this (yet), so 
>these are just ideas and I'm not married to any of them.

This discussion makes me think that we may not have the best API for 
this. The current semantics of MPI_Comm_restore are optimized to make 
it easier for users to restore a process, omitting all the 
complications with generations and coordination and reducing the 
amount checks required to make sure a process needs to be restored. 
However, it is a poor choice if the application initially wants to 
restore a given process after failure but then changes its mind after 
subsequent failures. In this case we have a race condition where a 
MPI_Comm_restore request from a previous generation may cause a 
restore operation after a later failure when the application had not 
meant to restore the process. I'm not sure if this use case is going 
to be very common but having a race condition like this is a notable 
hole in the spec and as David points out, the solution is trivial: 
add a generation number field for each rank in MPI_Comm_validate and 
use these generations in MPI_Comm_restore.

Greg Bronevetsky
Computer Scientist
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov

More information about the mpiwg-ft mailing list