<html>
<body>
<blockquote type=cite class=cite cite=""><font face="Calibri">This
implies that layered s/w (application or middleware) would be responsible
for regenerating any lost traffic, if this is needed. Perhaps if
would make sense to provide the ability for upper layers to register a
recovery function that could be called after repair() is done with
restoring MPI internal state.<br>
</font></blockquote>Would this function be called on the surviving
processes? I'm not sure that is is necessary. It seems that a simpler way
to go is to allow any subsequent MPI calls to block as necessary until
the appropriate repairs have been completed. This way the MPI
implementation can choose exactly when to do individual repairs and may
in fact choose to perform the repair between the time it detects the
problem and the time it informs the application.<br><br>
<blockquote type=cite class=cite cite=""><font face="Calibri">Greg, is
the piggy-back capability you are asking for intended to help getting the
application level communications reset to the state just before failure,
so that the application can continue ?</font></blockquote><br>
Yes, exactly. Piggybacking + non-blocking everything makes it possible
for middleware to coordinate checkpointing and log enough message and
non-determinism information to restore any needed communication. Having a
piggybacking API is necessary to ensure good performance. It is ok not to
have all operations be non-blocking but if that is the case, we'll need
good semantics on what happens when a process participates in a
communication that involves the failed process (I'll define
"good" later since I doubt the details matter now).<br>
<x-sigsep><p></x-sigsep>
Greg Bronevetsky<br>
Post-Doctoral Researcher<br>
1028 Building 451<br>
Lawrence Livermore National Lab<br>
(925) 424-5756<br>
bronevetsky1@llnl.gov</body>
</html>