[Mpi3-ft] Point-to-point Communications recovery

Greg Bronevetsky bronevetsky1 at llnl.gov
Thu Oct 23 16:43:23 CDT 2008


>This implies that layered s/w (application or middleware) would be 
>responsible for regenerating any lost traffic, if this is 
>needed.  Perhaps if would make sense to provide the ability for 
>upper layers to register a recovery function that could be called 
>after repair() is done with restoring MPI internal state.
Would this function be called on the surviving processes? I'm not 
sure that is is necessary. It seems that a simpler way to go is to 
allow any subsequent MPI calls to block as necessary until the 
appropriate repairs have been completed. This way the MPI 
implementation can choose exactly when to do individual repairs and 
may in fact choose to perform the repair between the time it detects 
the problem and the time it informs the application.

>Greg, is the piggy-back capability you are asking for intended to 
>help getting the application level communications reset to the state 
>just before failure, so that the application can continue ?

Yes, exactly. Piggybacking + non-blocking everything makes it 
possible for middleware to coordinate checkpointing and log enough 
message and non-determinism information to restore any needed 
communication. Having a piggybacking API is necessary to ensure good 
performance. It is ok not to have all operations be non-blocking but 
if that is the case, we'll need good semantics on what happens when a 
process participates in a communication that involves the failed 
process (I'll define "good" later since I doubt the details matter now).

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081023/3fd3c8ff/attachment-0001.html>


More information about the mpiwg-ft mailing list