[Mpi3-ft] A need to detect any failure
Adam Moody
moody20 at llnl.gov
Fri Aug 1 13:03:21 CDT 2008
Hi all,
I like Erez's idea to associate errors with call sites. However, I
still believe there is a case where processes do not communicate with
each other directly, but will need to know when any process in the
system dies. Here is an example.
Let's say an application splits MPI_COMM_WORLD into a 2d cartesian
grid. Then each process creates a "row" communicator and a "column"
communicator. From here on out, each process only communicates through
its row and column communicators. (I think this is what a number of our
codes really do, so this is a very realistic example). In the case of a
failure, assume this application requires a global rollback, and assume
that each process writes a checkpoint file at the end of each iteration,
which looks like the following:
for (iteration=0; iteration<numTimesteps; iteration++) {
row_xchange();
column_xchange();
compute();
checkpoint(iteration);
}
Now consider processes (i, j) and (i+1, j+1) in this 2d cartesion grid.
Because these two processes are in different rows and columns, they
don't share a row or column communicator, and so they never communicate
with each other directly. Now, assume process (i+1, j+1) fails.
Process (i, j) needs to rollback, but how will it be notified?
One solution would be to force the application to call
MPI_Barrier(MPI_COMM_WORLD) inside of its iteration loop. While this
would work, it seems costly, and it defeats all the effort the
application team went to in order to make the code scalable by using
just row and column communicators.
Another solution may be to rely on daisy chaing. That is, in the next
iteration after process (i+1, j+1) dies, processes (i+1, j) and (i, j+1)
may find out since they each share a communicator with the failed
process. Then in the following iteration, processes (i+1, j) and (i,
j+1) could propogate this failure message to process (i, j) since they
each share a communicator. This would also work, but a special error
code would be needed since the communication with (i+1, j) and (i, j+1)
may have succeeded just fine.
In the current standard, MPI handles this type of failure by invoking
the error handler on MPI_COMM_WORLD. This could be yet another solution.
-Adam
More information about the mpiwg-ft
mailing list