[Mpi3-ft] Process failure document

Thomas Herault herault.thomas at gmail.com
Wed Nov 5 05:37:36 CST 2008

Le 4 nov. 08 à 22:33, Richard Graham a écrit :

> I have captured a lot of what we have discussed about process fault- 
> tolerance, and filled in more missing gaps to help move us a long a  
> bit faster in our discussions.  Please take a look at the document  
> before the call tomorrow.  I would like to pick up discussing what  
> to do when collective communications fail.  There are still details  
> missing that need to be added.  No API’s at this stage, just the  
> “model”.  I ran this past 3 different application groups today –  
> this seems to be along the lines of what they are looking for, and  
> they had some very useful comments...
> Rich
> < 
> process_failure_tech 
> .txt>_______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


unfortunately, I cannot attend the conf call today.

Here are a couple of comments about the attached file.

>    - Failures that can be corrected by the MPI implementation  
> without input from the
>      application are not considered errors from MPI's perspective.   
> So a transient network
>      failure, or an link failure that can be routed around are not  
> considered failures.

I agree with this. I think it also implies that the link-failure  
scenario A <-> B not OK, A <-> C OK, C <-> B OK, and A detects a  
failure of B, but not C cannot happen. My point is that if the  
application level is able to communicate information from A to B  
through C, then the library also can, and this link failure should not  
be notified as a process failure. Hence, process failures can  
originate from link failures only if there is a real network partition  
(except for the case of asymmetric failures).

About asymmetric failures: do we have to consider this case? Does it  

About Error States and communicators:
>   - A communicator that has lost a process will be defined to be in an
>     error state
>   - A process moves into error state, if it tries to interact with a
>     failed process

There should be a concept of communicator, and communicator handle:  
this is what you refer as communicator / local part of the  
communicator in the last part of the file. Local parts of the  
communicators should be allowed to be in a more inconsistent state:  
the communicator handle enters the error state only when the process  
it belongs to tries to interact with a failed process belonging to the  
same communicator.

About Error notification:
- "synchronous" notification should also happen if A is waiting for  
the completion of some communication involving B (receive from B, or  
receive from ANY_SOURCE). In the case of MPI_ANY_SOURCE: the  
notification should not be guaranteed: if A receives from ANY_SOURCE,  
B fails and C sends a message to A, A should either enter the error  
state and notify the error, or notify the reception of a message from  
C. Both behaviors should be authorized. If no message arrive while A  
is in the receive, eventually, the error must be notified.

- "asynchronous" notification: we still need to discuss this. For me,  
the standard should allow asynchronous notification to happen only  
when exiting MPI calls (but high quality implementations can trigger  
this at any time), but errors about any process on any communicator  
can be notified when exiting any MPI call.

>  Error notification must be consistent, i.e. MPI must propagate a  
> single view of the system,
>  which may change over time.

Is it possible, if we consider asymmetric failures cases?

About recovery and Point-to-Point:
>        - caller specifies the mode of recovery - restore processes,  
> replace missing processes with
>          MPI_PROC_NULL, or eliminate the communicator.
Eliminating the communicator should be some kind of collective. Which  
is contradictory with the locality of the correction. I agree that the  
recovery should be a local operation. Some of the recovery mechanisms  
should be collective (like eliminate the comm., maybe restore the  
process). Others should be local (replace with MPI_PROC_NULL).

Two processes cannot decide differently on the recovery action for a  
same failure.

About Communicator life cycle:
>  No
>              communications can process while the communicator is in  
> an error state (Question:
>              This could require checking error state all the time,  
> so may need to think of a
>              better way to do this).

Maybe restrict this to communications involving the failed processes, or
No communications can process while the local part of the communicator  
is in an error state.


More information about the mpiwg-ft mailing list