[Mpi3-ft] MPI_Comm_validate_all and progression

Thu Apr 28 14:22:02 CDT 2011

As some of you are aware, I've always questioned whether MPI_Comm_validate_all can be implemented.  I've stopped questioning that for a while now, but since I have to actually implement this if it gets passed, I've been looking at it further.  In particular, I started reading up on the papers that Josh posted in the run-through stabilization.  

I have argued in the past based on some papers I've read that any deterministic algorithm which relies on messages to arrive at consensus will have a "last" message for any possible execution of the algorithm.   The content of that message cannot possibly impact whether the receiving process commits because the message could experience failure.   The algorithm would have to "work" whether the last message was received successfully or not and therefore an algorithm that was identical to the first, but did not have this "last message" would be equally valid.   This variation of the algorithm also has some "last message", which could therefore be removed using the same logic, until we arrived at an algorithm which uses no messages.   Clearly a consensus algorithm which uses no messages would be either invalid or useless. 

>From what I can tell, the 3-phase commit with termination protocol, which is touted as a possible implementation for MPI_Validate_all, avoids this by being non-terminating in some ways.  Any message which results in a failure will lead to more messages being sent until failures cease or there is only one rank in the system.  The key implication of this is that some ranks may have committed and are considered to be "done" with the algorithm, but other ranks that have detected rank failures may call upon this "done" rank to aid in their consensus.  I would love feedback as to whether I am understanding this correctly. 

>From a practical standpoint, this means that a rank may leave MPI_Comm_validate_all and still be called-on to participate in the termination protocol invoked by another rank that detected a failure late within MPI_Comm_validate_all.  This assumes that the MPI progression engine is always active and that ranks can depend on making progress within an MPI call regardless of  whether remote ranks are currently within an MPI call.  I believe we have already introduced the same issue with MPI_Cancel in MPI1.0 and again with the MPI-3 RMA one-sided buffer attach operation.  However, we have consistently been non-committal on the issue of progression.  We know that MPI implementation do not generally have a progression thread and users have resorted to odd behaviors like calling MPI_Testany() sporadically within the computation portion of their code to improve progression of non-blocking calls.  If I am correct in all this, then at some point we have to be honest and acknowledge in the standard that implementing certain MPI features requires either active messages or a communication thread.    

Another assumption of the 3-phase commit is around the network:  1) The network never fails, only sites fail, 2) site failures are detected and reported by the network.  In reality, for most networks, failures are possible and it is often impossible to distinguish between a network failure and a remote rank failure.   Therefore, when a network failure is detected, it must be converted to a rank failure in order to maintain the assumptions of the 3-phase commit.  If rank A detects a communication failure between rank A & B, it must assume that rank B may be alive and will conclude rank A is dead.  Rank A must therefore exit, but not before cleanly disconnecting from all other ranks lest they perceive rank A's exit as a network failure leading to a domino effect of all ranks exiting.  This is not difficult, but may be an interesting bit of information for MPI implementers.    

Thanks,
Dave