Purpose: This is intended to describe how MPI will respond when a process either fails, or appears to have failed, due to a broken network connection. Use case scenario: - Symmetric failure: one process out of N is no longer responsive - Asymmetric failure: Communications A->B are ok, but B->A are not Errors: - Error detection is external to the MPI specification, and is implementation specific. - Failures that can be corrected by the MPI implementation without input from the application are not considered errors from MPI's perspective. So a transient network failure, or an link failure that can be routed around are not considered failures. - An MPI implementation must support detecting the following errors: - process failure. A process is considered as failed if it is non-responsive - failed or disconnected. - communications failure Error states: - Error state is associated with a communicator - A communicator that has lost a process will be defined to be in an error state - A process moves into error state, if it tries to interact with a failed process - When an MPI process is restored, it is restored in an error state - Each process that is in an error state must undertake an MPI repair action - Recovery from error state must done one communicator at a time Error notification: By default only calls in which the remote process is involved with will be notified of error synchronously. So if B fails, A will be notified when: - send to B is attempted - receive from B is posted - receive from MPI_ANY_SOURCE is posted - put to B - get from B - collective operation is posted - collective I/O is attempted - new communicator is created, where B is part of the new communicator A process can subscribe for notification for event that it would not be notified of by default. This notification will be asynchronous. Error notification must be consistent, i.e. MPI must propagate a single view of the system, which may change over time. Recovery process: The overall approach is to keep recovery local to the processes directly affected by the failure, and minimize global recovery. Point-to-point communications: - Recovery is local, i.e., not collective - When the application is notified of failure, the recover function must be called, if the application wants to continue using MPI. - The recovery function will have the following attributes: - called on a per communicator basis - caller specifies the mode of recovery - restore processes, replace missing processes with MPI_PROC_NULL, or eliminate the communicator. - caller specifies what to do with outstanding point-to-point communications traffic - preserve traffic with unaffected ranks, or remove all traffic queued on the given communicator. (we do have a potential race condition in the last case, if this communicator will continue to send data - need to think more about this) - the recovery function may handle recovery for as many failed processes as it is aware of at a given time. This implies that when subsequent MPI calls hit error conditions that have not yet been cleared and return an error (this is a requirement), the subsequent call to the recover function will not restore a process that has already been restored. Collective communications: - Recovery is collective. - The collective call will return an error code that indicates the known state of the communicator , with failure of any process in the communicator putting it into an error state. - By default, the error code will be generated based on local information, so if the collective completes locally successfully but an other process has failed, MPI_SUCCESS will be returned. - An "attribute" can be set on a per communicator basis that will require the communicator to verify that the collective has completed successful on a global basis. This does not require a two phase collective, i.e. data may be written to the destination buffer before global verification has occurred, but completion can't occur until the global check has occurred. - MPI should "interrupt" outstanding collectives, and return an error code if it detects an error. MPI should also remove all remove "traces" of the failed collectives. - MPI should handle outstanding traffic associated with the failed processes, as described in the point-to-point recovery section. - MPI will provide a collective (?) function to check the state of the communicator, as a way to verify global state, something like mpi_fault_status(comm, array_of_status_codes) or MPI_Comm_validate(comm) Process recovery: Communicator life cycle: - A communicator starts its life with N processes, and in an ACTIVE state. - Once an process fails, the communicator enters an error state. - When in error state: - Point-to-point communications - If the failed process is not involved in the communications, the local part of the communicator will NOT enter into en error state, and communications will continue as though no error has occurred. - If the failed process IS involved in the communications, the local part of the communicator will BE marked as in ERROR state. The application must call the MPI recovery function to restore the state of the communicator to ACTIVE. No communications can process while the communicator is in an error state (Question: This could require checking error state all the time, so may need to think of a better way to do this). - Collective communications - Any process failure with outstanding communications puts all parts of the communicator in an error state. A collective MPI recovery function must be called to restore the state of the communicator. This gives the implementation the ability to perform any optimization it typically will do for collective operations when communicators are constructed. Error return codes: - The application needs to set MPI_ERRORS_RETURN in all processes if it is to take advantage of MPI's error recovery capabilities.