Purpose: This is intended to describe how MPI will respond when a process either fails, or appears to have failed, due to a broken network connection. Use case scenario: - Symmetric failure: one process out of N is no longer responsive - Asymmetric failure: Communications A->B are ok, but B->A are not HCH> Are these all relevant scenarios? Errors: - Error detection is external to the MPI specification, and is implementation specific. - Failures that can be corrected by the MPI implementation without input from the application are not considered errors from MPI's perspective. So a transient network failure, or an link failure that can be routed around are not considered failures. HCH> Should list what *are* failures then - death of process, and complete partitioning of a network would count as failures; death of a link would onot unless the failed link is the only one available ... - An MPI implementation must support detecting the following errors: - process failure. A process is considered as failed if it is non-responsive - failed or disconnected. - communications failure HCH> what's a comms failure? A broken link or timeout? Error states: HCH> so we have at least two levels of error states here: communicators can have global (silent?) errors, and individual processesXcommunicators encounter "acute" errors; - Error state is associated with a communicator - A communicator that has lost a process will be defined to be in an error state - A process moves into error state, if it tries to interact with a failed process - When an MPI process is restored, it is restored in an error state - Each process that is in an error state must undertake an MPI repair action - Recovery from error state must done one communicator at a time HCH> So repair needs to be done per communicator and per process? Error notification: By default only calls in which the remote process is involved with will be notified of error synchronously. So if B fails, A will be notified when: - send to B is attempted HCH> what happens for nonblocking ops - in an isend(), or in the related wait()? - receive from B is posted - receive from MPI_ANY_SOURCE is posted - put to B - get from B - collective operation is posted - collective I/O is attempted - new communicator is created, where B is part of the new communicator A process can subscribe for notification for event that it would not be notified of by default. This notification will be asynchronous. HCH> OK, this is similar to PVMs notify functionality. Error notification must be consistent, i.e. MPI must propagate a single view of the system, which may change over time. HCH> This can't really be true even for synchronous notification, since processes that don't interact with a failed process will not see these failures ... need care to define what is *really* required here. Recovery process: The overall approach is to keep recovery local to the processes directly affected by the failure, and minimize global recovery. Point-to-point communications: - Recovery is local, i.e., not collective HCH> but error staes belong to communicators, and have to be consistent across all procs -> can we always do the repair in a purely local way? - When the application is notified of failure, the recover function must be called, if the application wants to continue using MPI. - The recovery function will have the following attributes: - called on a per communicator basis - caller specifies the mode of recovery - restore processes, replace missing processes with MPI_PROC_NULL, or eliminate the communicator. - caller specifies what to do with outstanding point-to-point communications traffic - preserve traffic with unaffected ranks, or remove all traffic queued on the given communicator. (we do have a potential race condition in the last case, if this communicator will continue to send data - need to think more about this) - the recovery function may handle recovery for as many failed processes as it is aware of at a given time. This implies that when subsequent MPI calls hit error conditions that have not yet been cleared and return an error (this is a requirement), the subsequent call to the recover function will not restore a process that has already been restored. HCH> We have procs A, B, and C. A fails, B sends to A and gets an error; B repairs, sends to C; C receives from B and sends to A -> will C see an error? Will C need to do a repair? Will C be *able* to do a repair that is the same (or different) from B's? Collective communications: - Recovery is collective. - The collective call will return an error code that indicates the known state of the communicator , with failure of any process in the communicator putting it into an error state. - By default, the error code will be generated based on local information, so if the collective completes locally successfully but an other process has failed, MPI_SUCCESS will be returned. HCH> OK. Bu we are not globally consistent anymore! - An "attribute" can be set on a per communicator basis that will require the communicator to verify that the collective has completed successful on a global basis. This does not require a two phase collective, i.e. data may be written to the destination buffer before global verification has occurred, but completion can't occur until the global check has occurred. HCH> which in efefct means that there needs to be a sync point at the end ... - MPI should "interrupt" outstanding collectives, and return an error code if it detects an error. MPI should also remove all remove "traces" of the failed collectives. HCH> would that require reverting any apps-visible changes to buffers?? - MPI should handle outstanding traffic associated with the failed processes, as described in the point-to-point recovery section. - MPI will provide a collective (?) function to check the state of the communicator, as a way to verify global state, something like mpi_fault_status(comm, array_of_status_codes) or MPI_Comm_validate(comm) Process recovery: HCH> what will a process need to do that is being recovered? which state will a process be recovered into? The only well-defined state would be a complete restart, but then you lose all non-predefined communicators ... Communicator life cycle: - A communicator starts its life with N processes, and in an ACTIVE state. HCH> the predefined comms astart that way, any others come out of global comms creation routines. OK. - Once an process fails, the communicator enters an error state. HCH> In the model, yes; in an implementation, this state will be recognized "later", like when the implementation needs to contact/communicate with said process ... - When in error state: - Point-to-point communications - If the failed process is not involved in the communications, the local part of the communicator will NOT enter into en error state, and communications will continue as though no error has occurred. - If the failed process IS involved in the communications, the local part of the communicator will BE marked as in ERROR state. The application must call the MPI recovery function to restore the state of the communicator to ACTIVE. No communications can process while the communicator is in an error state (Question: This could require checking error state all the time, so may need to think of a better way to do this). HCH> But that's not a new requirement - when you don't check the error state, anything can go wrong ... we might want to provide a "mode" where such an error terminates the process/app to avoid deadlocks or unneded resource use by unaware apps ... - Collective communications - Any process failure with outstanding communications puts all parts of the communicator in an error state. A collective MPI recovery function must be called to restore the state of the communicator. This gives the implementation the ability to perform any optimization it typically will do for collective operations when communicators are constructed. HCH> But didn't you state above that somne procs could succeed in a collectrive call? Error return codes: - The application needs to set MPI_ERRORS_RETURN in all processes if it is to take advantage of MPI's error recovery capabilities. HCH> We might want to add something like MPI_RECOVERABLE_ERRORS_RETURN to gracefully react to errors that you can recover from, like the process failure we are discussing.