[Mpi3-ft] Weekly con-calls

Mon Sep 22 12:51:54 CDT 2008

>Lets resume these meetings  on Wed next week, 9/24/2008, noon-2pm Eastern
>time  moving this an hours later does not work for the people from Japan.
>What we want to talk about next time are specific user use-cases.  We have
>talked so far in generalities, but now need to get to specifics in term of
>how apps actually want to use this functionality.  As an example, take the
>case of client/server apps, with a failure in on of the ³clients².  If the
>client is member of the remote group in the intercommunicator, fully
>defining error scenarios in the current MPI-2 dynamics should be sufficient

Lets start talking about this scenario. The first 
thing to talk about is the functionality that may 
be desired by the application. There are two 
cases to consider. When the client is not 
expecting a response from the server, its failure 
is irrelevant to the server and the server 
doesn't want to be informed of the failure. 
However, if the client fails with pending 
communication from the server, the server does 
want to be informed of the failure, if only to 
free the internal state associated with this 
client. As such, the application may desire one 
of two things. First, it may wish MPI to inform 
it every time it tries to communicate with a 
failed process. This way the server is notified 
during its next send or receive operation to/from 
the client and is thus able to perform the 
appropriate cleanup operations. Furthermore, if 
the server wants to be able to respond as quickly 
as possible, it can ask MPI to tell it via a 
callback when the client fails. We've already 
discussed both types of interfaces in the group. 
More specifically, the MPI should inform the 
application either during the application's next 
call that uses the above intercommunicator or 
invoke the callback associated with this intercommunicator.

Looking at the problem at a lower level, we can 
consider the various low-level problems that may 
cause client "failure" and what MPI must do to 
support the application's recovery efforts. The 
first possibility is that the client node fails. 
MPI can detect this event by using heart-beat 
messages or by noticing that a message sent to 
the client has timed out. In this case MPI should 
use one of the above APIs to inform the other 
processes that further communication to/from the 
client will fail. Another possibility is that 
some network link fails, disabling communication 
between this client and this server but allowing 
communication between other rank pairs. In this 
case MPI may inform the application about the 
pairs of ranks that can no longer communicate. 
Another option would be for MPI to kill half the 
processes that are affected by this problem, 
ensuring that all the remaining processes can 
communicate freely with each other. It would then 
inform the application of these process deaths as 
above. The final possibility is a network 
partition where a subset of processes cannot 
communicate with the remaining processes. MPI 
would treat this case as one of the subsets 
failing and kill all the processes in this 
subset. It would then inform the processes in the other subset of these deaths.

The above is essentially a summary of what we've 
talked about thus far, as applied to this 
example. This is good, since this suggests that 
we understand the problem well enough, at least 
for this purpose. A good question then is, can 
anybody think of any additional details that we need to think about?

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov