[Mpi3-ft] Point-to-point Communications recovery
Greg Bronevetsky
bronevetsky1 at llnl.gov
Thu Oct 23 16:43:23 CDT 2008
>This implies that layered s/w (application or middleware) would be
>responsible for regenerating any lost traffic, if this is
>needed. Perhaps if would make sense to provide the ability for
>upper layers to register a recovery function that could be called
>after repair() is done with restoring MPI internal state.
Would this function be called on the surviving processes? I'm not
sure that is is necessary. It seems that a simpler way to go is to
allow any subsequent MPI calls to block as necessary until the
appropriate repairs have been completed. This way the MPI
implementation can choose exactly when to do individual repairs and
may in fact choose to perform the repair between the time it detects
the problem and the time it informs the application.
>Greg, is the piggy-back capability you are asking for intended to
>help getting the application level communications reset to the state
>just before failure, so that the application can continue ?
Yes, exactly. Piggybacking + non-blocking everything makes it
possible for middleware to coordinate checkpointing and log enough
message and non-determinism information to restore any needed
communication. Having a piggybacking API is necessary to ensure good
performance. It is ok not to have all operations be non-blocking but
if that is the case, we'll need good semantics on what happens when a
process participates in a communication that involves the failed
process (I'll define "good" later since I doubt the details matter now).
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081023/3fd3c8ff/attachment-0001.html>
More information about the mpiwg-ft
mailing list