[Mpi3-ft] MPI_Comm_validate_all and progression

Thu Apr 28 17:46:49 CDT 2011

On Apr 28, 2011, at 3:43 PM, Darius Buntinas wrote:

> Hi Dave,
> 
> I've been struggling with this as well for the last few weeks as I've been implementing validate_all for MPICH.  Comments inline.
> 
> On Apr 28, 2011, at 2:22 PM, Solt, David George wrote:
> 
>> As some of you are aware, I've always questioned whether MPI_Comm_validate_all can be implemented.  I've stopped questioning that for a while now, but since I have to actually implement this if it gets passed, I've been looking at it further.  In particular, I started reading up on the papers that Josh posted in the run-through stabilization.  
>> 
>> I have argued in the past based on some papers I've read that any deterministic algorithm which relies on messages to arrive at consensus will have a "last" message for any possible execution of the algorithm.   The content of that message cannot possibly impact whether the receiving process commits because the message could experience failure.   The algorithm would have to "work" whether the last message was received successfully or not and therefore an algorithm that was identical to the first, but did not have this "last message" would be equally valid.   This variation of the algorithm also has some "last message", which could therefore be removed using the same logic, until we arrived at an algorithm which uses no messages.   Clearly a consensus algorithm which uses no messages would be either invalid or useless. 
>> 
>>> From what I can tell, the 3-phase commit with termination protocol, which is touted as a possible implementation for MPI_Validate_all, avoids this by being non-terminating in some ways.  Any message which results in a failure will lead to more messages being sent until failures cease or there is only one rank in the system.  The key implication of this is that some ranks may have committed and are considered to be "done" with the algorithm, but other ranks that have detected rank failures may call upon this "done" rank to aid in their consensus.  I would love feedback as to whether I am understanding this correctly. 
> 
> This is how I've implemented it.  Even after the process returns from validate_all(), the library still responds to protocol messages because one's never sure that everyone has terminated (or terminated the termination protocol).

Yep. That's is one of the key properties of the termination protocol within both the two and three phase commit protocols. It is a pretty common requirement for exactly the reasoning that you present.

One thing to keep in mind, as an implementation alternative, is that an external 'oracle' could be assisting in the consensus process. So instead of relying on a 'done' process for a peer to 'catch-up' (for example) the oracle could be consulted directly, out-of-band. In practice the oracle could be provided as part of the runtime environment supporting the MPI. I mention this only to allow folks to think about resources outside of the specific collective algorithm that might help in the collective algorithm. Of course there are other problems with this technique, but that is for another time. ;)

> 
>>> From a practical standpoint, this means that a rank may leave MPI_Comm_validate_all and still be called-on to participate in the termination protocol invoked by another rank that detected a failure late within MPI_Comm_validate_all.  This assumes that the MPI progression engine is always active and that ranks can depend on making progress within an MPI call regardless of  whether remote ranks are currently within an MPI call.  I believe we have already introduced the same issue with MPI_Cancel in MPI1.0 and again with the MPI-3 RMA one-sided buffer attach operation.  However, we have consistently been non-committal on the issue of progression.  We know that MPI implementation do not generally have a progression thread and users have resorted to odd behaviors like calling MPI_Testany() sporadically within the computation portion of their code to improve progression of non-blocking calls.  If I am correct in all this, then at some point we have to be honest and acknowledge in!
>> the standard that implementing certain MPI features requires either active messages or a communication thread.    
> 
> Well, technically speaking all processes need to call at least one more MPI function after they call validate_all(): finalize().  There shouldn't be any deadlocks introduced by this because if a process is blocked waiting for another process, the blocked process will be calling an MPI function (either polling, or blocking in something like wait()).  This same thing happens with rendezvous, and as you mentioned cancel, so this doesn't add a new problem.

I think Darius makes a good point. In the Open MPI implementation I rely on the 'done' process still progressing to check if a peer needs to 'catch up' from a previous iteration. However the progress is not required to happen in a progress thread (though it could), but could (more likely) happen during the next MPI operation which could very well be MPI_Finalize.

Checking for 'catch-up' messages might be stopped if it is absolutely known that no outstanding validates are in progress, but that is a tough thing to guarantee some times.

> Personally, I'm going to try to stay out of the mess of trying to define progress portably.

Me too :)

> 
>> Another assumption of the 3-phase commit is around the network:  1) The network never fails, only sites fail, 2) site failures are detected and reported by the network.  In reality, for most networks, failures are possible and it is often impossible to distinguish between a network failure and a remote rank failure.   Therefore, when a network failure is detected, it must be converted to a rank failure in order to maintain the assumptions of the 3-phase commit.  If rank A detects a communication failure between rank A & B, it must assume that rank B may be alive and will conclude rank A is dead.  Rank A must therefore exit, but not before cleanly disconnecting from all other ranks lest they perceive rank A's exit as a network failure leading to a domino effect of all ranks exiting.  This is not difficult, but may be an interesting bit of information for MPI implementers.    
> 
> We don't rely on comm failures at the library to detect processor failures.  We have the process manager detect failures, and once the process manager decides a process is dead, it's dead.  (Yeah I'm punting this to the process manager.)

We use our runtime environment's process and error management services (functionality similar to the process manager in MPICH2) as the primary mechanism for discovering process failures. If a peer sees a link drop at the MPI level, it will send a notice through the runtime environment indicating a potential peer failure. The runtime will either confirm or reject this and progress appropriately. It is a bit of a work around, but it works for us.

Network partitioning can be a complex fault scenario, and one not explicitly discussed in the current proposal (though would be a great carryon topic/extension). If process A cannot talk to process B then both sides needs to figure out if the other is dead or just disconnected. Usually one side (say process A) will try to find a majority set of processes to agree that process B is dead - or at least disconnected from the majority of processes. Process B will be doing the same thing. The minority group will lose and the majority group will win. The losing group (in this case just process B) will self terminate because they have been orphaned by a network partition failure.

If a peer process (process C) can talk to process B, then A and B can use C as a route for inter process messaging. This can be handled under the covers in MPI, if that scenario is possible in your favorite supported environment.

The common scenario discussed here is the split-brained network partition problem in which two equal groups must decide what the fate of the other group was and how to continue. This is usually 'solved' by the majority wins rule, though it is not quite an elegant solution.

There are both performance and programability issues with network partitioning, which is why it would be interesting to look at (in a secondary, carryon proposal) what types of errors, interfaces, and semantics would be useful to the user in this situation. If A and B are going to route through C, there is a performance penalty and doing so under the covers might not be desirable. So we might explore a semantic where when A sends to B it gets a 'MPI_ERR_NOT_DIRECTLY_ATTACHED' error (same for B talking to A if it is a bi-directional problem). Then the application knows that any further communication with B will either return an error or impose on other processes (might be good to know which other processes).

For the current proposal we assume that process failure can be discovered, and network failures can be addressed (either by selectively terminating isolated processes, or re-connecting those processes through another means). 

Does that help?

Josh

> 
> -d
> 
>> Thanks,
>> Dave
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft