[Mpi3-ft] Process failure document

Hoppe, Hans-Christian hans-christian.hoppe at intel.com
Wed Nov 5 11:02:14 CST 2008

I've added some comments to the document, which you'll find appended to
this message. 
Agree with Rich that getting the model for colective comms right has
high priority, so it's OK to focus on this one. 
Questions I'd have to the group are
  - is our model symmetric - that is if 2n processes are split in the
middle into two groups of n processes, will each group
    see the same errors and be able to recover? 
  - supose processes A and B. A fails after a while, B notices and
starts a repair with restore processes; in which state will
    a process A' be whrn it joins B after the repair? Will A' need to do
any special repair call itself? 

Hans-Christian Hoppe
Principal Engineer

Intel GmbH                           Phone: +49-2232-2090-11
Hermuelheimer Strasse 8a             Fax:   +49-2232-2090-29
50321 Bruehl, Germany              



From: mpi3-ft-bounces at lists.mpi-forum.org
[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Richard Graham
Sent: Dienstag, 4. November 2008 22:34
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] Process failure document

I have captured a lot of what we have discussed about process
fault-tolerance, and filled in more missing gaps to help move us a long
a bit faster in our discussions.  Please take a look at the document
before the call tomorrow.  I would like to pick up discussing what to do
when collective communications fail.  There are still details missing
that need to be added.  No API's at this stage, just the "model".  I ran
this past 3 different application groups today - this seems to be along
the lines of what they are looking for, and they had some very useful

Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081105/48a7e592/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: MPI-FT-20081105.txt
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081105/48a7e592/attachment-0001.txt>

More information about the mpiwg-ft mailing list