Purpose: ******** This is intended to describe how MPI will respond when a process either fails, or appears to have failed, due to a broken network connection. Use case scenario: ****************** - Symmetric failure: one process out of N is no longer responsive - Asymmetric failure: Communications A->B are ok, but B->A are not Notes:- loss of communication is one indication of process "failure" from MPI's perspective, even if the remote process is still alive. - It is the responsibility of the MPI implementation to propagate a consistent view of the state of the parallel job to the application. So, in the case of the asymmetric failure, all procs have to either be notified that A has failed, or no process failure notification at all. The implementation will need to decide how to handle the asymmetric failure, and if it should migrate/terminate process "A", or just continue as is. A decision to move the process needs no user notification. Errors: ******* - Error detection is external to the MPI specification, and is implementation specific. - Failures that can be corrected by the MPI implementation without input from the application are not considered errors from MPI's perspective. So a transient network failure, or an link failure that can be routed around are not considered failures. - An MPI implementation must support detecting the following errors: - process failure. A process is considered as failed if it is non-responsive - failed or disconnected. - communications failure Error states: ************* - Error state is associated with a communicator - A communicator that has lost a process will be defined to be in an error state - A process moves into error state, if it tries to interact with a failed process - When an MPI process is restored, it is restored in an error state - Each process that is in an error state must undertake an MPI repair action - Recovery from error state must done one communicator at a time Error notification: ******************* By default only calls in which the remote process is involved with will be notified of error synchronously. So if B fails, A will be notified when: - send to B is attempted - receive from B is posted - receive from MPI_ANY_SOURCE is posted - put to B - get from B - collective operation is posted - collective I/O is attempted - new communicator is created, where B is part of the new communicator A process can subscribe for notification for event that it would not be notified of by default. This notification will be asynchronous. A process can subscribe for asynchronous notification on collective failure. Note: how do we make sure this notification occurs only once ? How do we associate this call with previously called, or yet to be called collectives, to ensure consistency across the application ? Error notification must be consistent, i.e. MPI must propagate a single view of the system, which may change over time. There may be a time lag in this information getting out to all processes. New error codes will need to be added to reflect the errors being handled by the implementation. For details on error states, a user defined callback will need to be registered with the MPI library. Registration characteristics: - registered per communicator - registered per error class - provide error context - perform process local work only - called only on error, when registered. The callback will be called before the MPI call returns, and is intended to provide a way to give error information back to the application in a thread safe manner, and still allowing current application to run unchanged. Error classes: ************** The following are the types of errors that an application can subscribe to: - process failure - network failure - file system failure - network degradation The error implications are implementation specific, and depends on how the implementation handles these errors. Recovery process: ***************** The overall approach is to, as much as possible, keep recovery local to the processes directly affected by the failure, and minimize global recovery. Point-to-point communications: - Recovery is local, i.e., not collective - When the application is notified of failure, the recover function must be called, if the application wants to continue using MPI. - The recovery function will have the following attributes: - called on a per communicator basis (there will be an option to aggregate recovery across communicators) - caller specifies the mode of recovery - restore processes, replace missing processes with MPI_PROC_NULL, or eliminate the communicator. - caller specifies what to do with outstanding point-to-point communications traffic - preserve traffic with unaffected ranks, or remove all traffic queued on the given communicator. (we do have a potential race condition in the last case, if this communicator will continue to send data - need to think more about this) - the recovery function may handle recovery for as many failed processes as it is aware of at a given time. This implies that when subsequent MPI calls hit error conditions that have not yet been cleared and return an error (this is a requirement), the subsequent call to the recover function will not restore a process that has already been restored. - The concept of an epoch does NOT exist Collective communications: - Recovery is collective. - The collective call will return an error code that indicates the known state of the communicator , with failure of any process in the communicator putting it into an error state. - By default, the error code will be generated based on local information, so if the collective completes locally successfully but an other process has failed, MPI_SUCCESS will be returned. - An "attribute" can be set on a per communicator basis that will require the communicator to verify that the collective has completed successful on a global basis. This does not require a two phase collective, i.e. data may be written to the destination buffer before global verification has occurred, but completion can't occur until the global check has occurred. - MPI should "interrupt" outstanding collectives, and return an error code if it detects an error. MPI should also remove all remove "traces" of the failed collectives. - MPI should handle outstanding traffic associated with the failed processes, as described in the point-to-point recovery section. - MPI will provide a collective (?) function to check the state of the communicator, as a way to verify global state, something like mpi_fault_status(comm, array_of_status_codes) or MPI_Comm_validate(comm) - The concept of an epoch is supportable. All repaired communicators need to be consistent after repair, that is the state of the process A is the same in all the communicators of which this process is a member. Process recovery: ***************** The user can specify the following information on a per communicator basis: - List of the vital ranks, i.e. the ranks without which the application can't continue, and that even if restored, the application still can't recover. - The minimum communicator size still usable by the application. - Replacement policy (the implementation must return an error in the recovery process if conflicting policies prevent consistent recovery) These would need to be set at communicator creation time. By default, the communicator will use the policy set by the parent communicator. In the case of intercommunicators, the default policy will be that set by the group associated with the parent communicator. The user may specify the following information on a global basis: - List of the vital ranks within MPI_COMM_WORLD, i.e. the ranks without which the application can't continue, and that even if restored, the application still can't recover. - The minimum communicator size still usable by the application. - Replacement policy Default policies: - No vital ranks - Minimum communicator size: greater than zero - Replacement policy: restore failed processes The recovery process will be invoked by the MPI_Recover() function for local recovery after point-to-point communications failure and by the MPI_Recover_collective() function for recovery after collective communications failure. The recovery policy will be associated with the communicator, so need not be specified when initiating recovery. Based on the recover policy, the recovery routines may: - restart failed processes. This may restart multiple processes in a single function entry. If the processes were restarted by another call to the recovery function (either by another process, or a previous call to a recovery function), another process with the same rank will not be created. - assign the restarted process a rank in the communicator - assign MPI_PROC_NULL to the processes that will not be restored Note: How should MPI be initialized ? Should we have MPI_Init() called by the application ? The implementation can set flags so it knows a restart is in progress. Do we notify the application it is being restarted ? If so, how ? Communicator life cycle: ************************ - A communicator starts its life with N processes, and in an ACTIVE state. - Once an process fails, the communicator enters an error state. - When in error state: - Point-to-point communications - If the failed process is not involved in the communications, the local part of the communicator will NOT enter into en error state, and communications will continue as though no error has occurred. - If the failed process IS involved in the communications, the local part of the communicator will BE marked as in ERROR state. The application must call the MPI recovery function to restore the state of the communicator to ACTIVE. No communications can process while the communicator is in an error state (Question: This could require checking error state all the time, so may need to think of a better way to do this). - Collective communications - Any process failure with outstanding communications puts all parts of the communicator in an error state. A collective MPI recovery function must be called to restore the state of the communicator. This gives the implementation the ability to perform any optimization it typically will do for collective operations when communicators are constructed. Enabling error notification: **************************** - The application needs to set MPI_ERRORS_RETURN in all processes if it is to take advantage of MPI's error recovery capabilities. Layered Libraries: ****************** The intent of the approach to fault tolerance in MPI is to give the application an opportunity to recover even if layered libraries are being used. This includes the cases where libraries and legacy libraries are being used, which may not have support for fault tolerance built into them, and if they do, may need to invoke different recovery policies. The following items are specifically aimed at providing this support: - Error state and error codes being associated with a communicator. The assumption is that libraries will isolate their communications from other MPI communications used in the application run by using their own communicator. - Recovery is communicator based. Supporting the use of "legacy libraries" - libraries that do not support fault tolerance, and never will, as these are important use case scenarios for existing applications - is desired. If a library can be restarted with out restarting the processes running this library, it meets this condition. Dynamic communicators: ********************** Error Callbacks: **************** For asynchronous error notification, a callback scheme needs to be supported, so that applications can register the functions they want to invoke when notified of the error. The following is the list of operations that are allowed within a call-back routine: - process local work - querying MPI for error data Disallowed: - MPI communications - MPI recovery API: ****