[Mpi3-ft] A need to detect any failure

Adam Moody moody20 at llnl.gov
Fri Aug 1 13:03:21 CDT 2008

Hi all,
I like Erez's idea to associate errors with call sites.  However, I 
still believe there is a case where processes do not communicate with 
each other directly, but will need to know when any process in the 
system dies.  Here is an example.

Let's say an application splits MPI_COMM_WORLD into a 2d cartesian 
grid.  Then each process creates a "row" communicator and a "column" 
communicator.  From here on out, each process only communicates through 
its row and column communicators.  (I think this is what a number of our 
codes really do, so this is a very realistic example).  In the case of a 
failure, assume this application requires a global rollback, and assume 
that each process writes a checkpoint file at the end of each iteration, 
which looks like the following:

for (iteration=0;  iteration<numTimesteps;  iteration++) {

Now consider processes (i, j) and (i+1, j+1) in this 2d cartesion grid.  
Because these two processes are in different rows and columns, they 
don't share a row or column communicator, and so they never communicate 
with each other directly.  Now, assume process (i+1, j+1) fails.  
Process (i, j) needs to rollback, but how will it be notified?

One solution would be to force the application to call 
MPI_Barrier(MPI_COMM_WORLD) inside of its iteration loop.  While this 
would work, it seems costly, and it defeats all the effort the 
application team went to in order to make the code scalable by using 
just row and column communicators.

Another solution may be to rely on daisy chaing.  That is, in the next 
iteration after process (i+1, j+1) dies, processes (i+1, j) and (i, j+1) 
may find out since they each share a communicator with the failed 
process.  Then in the following iteration, processes (i+1, j) and (i, 
j+1) could propogate this failure message to process (i, j) since they 
each share a communicator.  This would also work, but a special error 
code would be needed since the communication with (i+1, j) and (i, j+1) 
may have succeeded just fine.

In the current standard, MPI handles this type of failure by invoking 
the error handler on MPI_COMM_WORLD.  This could be yet another solution.

More information about the mpiwg-ft mailing list