[Mpi3-ft] Process failure document
herault.thomas at gmail.com
Wed Nov 5 05:37:36 CST 2008
Le 4 nov. 08 à 22:33, Richard Graham a écrit :
> I have captured a lot of what we have discussed about process fault-
> tolerance, and filled in more missing gaps to help move us a long a
> bit faster in our discussions. Please take a look at the document
> before the call tomorrow. I would like to pick up discussing what
> to do when collective communications fail. There are still details
> missing that need to be added. No API’s at this stage, just the
> “model”. I ran this past 3 different application groups today –
> this seems to be along the lines of what they are looking for, and
> they had some very useful comments...
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
unfortunately, I cannot attend the conf call today.
Here are a couple of comments about the attached file.
> - Failures that can be corrected by the MPI implementation
> without input from the
> application are not considered errors from MPI's perspective.
> So a transient network
> failure, or an link failure that can be routed around are not
> considered failures.
I agree with this. I think it also implies that the link-failure
scenario A <-> B not OK, A <-> C OK, C <-> B OK, and A detects a
failure of B, but not C cannot happen. My point is that if the
application level is able to communicate information from A to B
through C, then the library also can, and this link failure should not
be notified as a process failure. Hence, process failures can
originate from link failures only if there is a real network partition
(except for the case of asymmetric failures).
About asymmetric failures: do we have to consider this case? Does it
About Error States and communicators:
> - A communicator that has lost a process will be defined to be in an
> error state
> - A process moves into error state, if it tries to interact with a
> failed process
There should be a concept of communicator, and communicator handle:
this is what you refer as communicator / local part of the
communicator in the last part of the file. Local parts of the
communicators should be allowed to be in a more inconsistent state:
the communicator handle enters the error state only when the process
it belongs to tries to interact with a failed process belonging to the
About Error notification:
- "synchronous" notification should also happen if A is waiting for
the completion of some communication involving B (receive from B, or
receive from ANY_SOURCE). In the case of MPI_ANY_SOURCE: the
notification should not be guaranteed: if A receives from ANY_SOURCE,
B fails and C sends a message to A, A should either enter the error
state and notify the error, or notify the reception of a message from
C. Both behaviors should be authorized. If no message arrive while A
is in the receive, eventually, the error must be notified.
- "asynchronous" notification: we still need to discuss this. For me,
the standard should allow asynchronous notification to happen only
when exiting MPI calls (but high quality implementations can trigger
this at any time), but errors about any process on any communicator
can be notified when exiting any MPI call.
> Error notification must be consistent, i.e. MPI must propagate a
> single view of the system,
> which may change over time.
Is it possible, if we consider asymmetric failures cases?
About recovery and Point-to-Point:
> - caller specifies the mode of recovery - restore processes,
> replace missing processes with
> MPI_PROC_NULL, or eliminate the communicator.
Eliminating the communicator should be some kind of collective. Which
is contradictory with the locality of the correction. I agree that the
recovery should be a local operation. Some of the recovery mechanisms
should be collective (like eliminate the comm., maybe restore the
process). Others should be local (replace with MPI_PROC_NULL).
Two processes cannot decide differently on the recovery action for a
About Communicator life cycle:
> communications can process while the communicator is in
> an error state (Question:
> This could require checking error state all the time,
> so may need to think of a
> better way to do this).
Maybe restrict this to communications involving the failed processes, or
No communications can process while the local part of the communicator
is in an error state.
More information about the mpiwg-ft