<html>

<body>

<blockquote type=cite class=cite cite=""><font face="Calibri">This

implies that layered s/w (application or middleware) would be responsible

for regenerating any lost traffic, if this is needed.  Perhaps if

would make sense to provide the ability for upper layers to register a

recovery function that could be called after repair() is done with

restoring MPI internal state.<br>

</font></blockquote>Would this function be called on the surviving

processes? I'm not sure that is is necessary. It seems that a simpler way

to go is to allow any subsequent MPI calls to block as necessary until

the appropriate repairs have been completed. This way the MPI

implementation can choose exactly when to do individual repairs and may

in fact choose to perform the repair between the time it detects the

problem and the time it informs the application.<br><br>

<blockquote type=cite class=cite cite=""><font face="Calibri">Greg, is

the piggy-back capability you are asking for intended to help getting the

application level communications reset to the state just before failure,

so that the application can continue ?</font></blockquote><br>

Yes, exactly. Piggybacking + non-blocking everything makes it possible

for middleware to coordinate checkpointing and log enough message and

non-determinism information to restore any needed communication. Having a

piggybacking API is necessary to ensure good performance. It is ok not to

have all operations be non-blocking but if that is the case, we'll need

good semantics on what happens when a process participates in a

communication that involves the failed process (I'll define

"good" later since I doubt the details matter now).<br>

<x-sigsep><p></x-sigsep>

Greg Bronevetsky<br>

Post-Doctoral Researcher<br>

1028 Building 451<br>

Lawrence Livermore National Lab<br>

(925) 424-5756<br>

bronevetsky1@llnl.gov</body>

</html>