Purpose: 
******** 
  This is intended to describe how MPI will respond when a process 
  either fails, or appears to have failed, due to a broken network
  connection.

Use case scenario:
******************
   - Symmetric failure: one process out of N is no longer responsive
   - Asymmetric failure:  Communications A->B are ok, but B->A are not

   Notes:- loss of communication is one indication of process "failure" from MPI's perspective,
         even if the remote process is still alive.
         - It is the responsibility of the MPI implementation to propagate a consistent view of
         the state of the parallel job to the application.  So, in the case of the asymmetric
         failure, all procs have to either be notified that A has failed, or no process failure
         notification at all.  The implementation will need to decide how to handle the asymmetric
         failure, and if it should migrate/terminate process "A", or just continue as is.  A
         decision to move the process needs no user notification.


Errors:
*******
   - Error detection is external to the MPI specification, and is implementation specific.
   - Failures that can be corrected by the MPI implementation without input from the 
     application are not considered errors from MPI's perspective.  So a transient network 
     failure, or an link failure that can be routed around are not considered failures.
   - An MPI implementation must support detecting the following errors:
      - process failure.  A process is considered as failed if it is non-responsive - failed or disconnected.
      - communications failure

Error states:
*************
  - Error state is associated with a communicator
  - A communicator that has lost a process will be defined to be in an
    error state
  - A process moves into error state, if it tries to interact with a
    failed process
  - When an MPI process is restored, it is restored in an error state
  - Each process that is in an error state must undertake an MPI
    repair action
  - Recovery from error state must done one communicator at a time


Error notification:
******************* 
 By default only calls in which the remote process is
 involved with will be notified of error synchronously.  So if B fails,
 A will be notified when:
   - send to B is attempted
   - receive from B is posted
   - receive from MPI_ANY_SOURCE is posted
   - put to B
   - get from B
   - collective operation is posted
   - collective I/O is attempted
   - new communicator is created, where B is part of the new communicator

 A process can subscribe for notification for event that it would not
 be notified of by default.  This notification will be asynchronous.  

 A process can subscribe for asynchronous notification on collective failure.
 Note: how do we make sure this notification occurs only once ?  How do we associate
       this call with previously called, or yet to be called collectives, to ensure
       consistency across the application ?

 Error notification must be consistent, i.e. MPI must propagate a single view of the system, 
 which may change over time.  There may be a time lag in this information getting out to
 all processes.

 New error codes will need to be added to reflect the errors being handled by the implementation.

 For details on error states, a user defined callback will need to be registered with the MPI
 library.  Registration characteristics:
   - registered per communicator
   - registered per error class
   - provide error context
   - perform process local work only
   - called only on error, when registered.
 The callback will be called before the MPI call returns, and is intended to provide a way
 to give error information back to the application in a thread safe manner, and still allowing
 current application to run unchanged.

Error classes:
**************
   The following are the types of errors that an application can subscribe to:
  - process failure
  - network failure
  - file system failure
  - network degradation

  The error implications are implementation specific, and depends on how the implementation handles
 these errors.


Recovery process:
*****************
  The overall approach is to, as much as possible, keep recovery local to the processes directly affected by 
  the failure, and minimize global recovery.
  Point-to-point communications:
    - Recovery is local, i.e., not collective
    - When the application is notified of failure, the recover function must be called, if the
        application wants to continue using MPI.
    - The recovery function will have the following attributes:
       - called on a per communicator basis (there will be an option to aggregate recovery across
         communicators)
       - caller specifies the mode of recovery - restore processes, replace missing processes with
         MPI_PROC_NULL, or eliminate the communicator.
       - caller specifies what to do with outstanding point-to-point communications traffic - preserve
         traffic with unaffected ranks, or remove all traffic queued on the given communicator.  (we do
         have a potential race condition in the last case, if this communicator will continue to send
         data - need to think more about this)
       - the recovery function may handle recovery for as many failed processes as it is
         aware of at a given time.  This implies that when subsequent MPI calls hit error conditions
         that have not yet been cleared and return an error (this is a requirement), the subsequent
         call to the recover function will not restore a process that has already been restored.
     - The concept of an epoch does NOT exist

   Collective communications:
     - Recovery is collective.
     - The collective call will return an error code that indicates the known state of the
       communicator , with failure of any process in the communicator putting it into an error state.
     - By default, the error code will be generated based on local information, so if the collective
       completes locally successfully but an other process has failed, MPI_SUCCESS will be returned.
     - An "attribute" can be set on a per communicator basis that will require the communicator to
       verify that the collective has completed successful on a global basis.  This does not require
       a two phase collective, i.e. data may be written to the destination buffer before global
       verification has occurred, but completion can't occur until the global check has occurred.
     - MPI should "interrupt" outstanding collectives, and return an error code if it detects an error.
       MPI should also remove all remove "traces" of the failed collectives.
     - MPI should handle outstanding traffic associated with the failed processes, as described
       in the point-to-point recovery section.
     - MPI will provide a collective (?) function to check the state of the communicator, as a way
       to verify global state, something like mpi_fault_status(comm, array_of_status_codes) or
       MPI_Comm_validate(comm)
     - The concept of an epoch is supportable.

 All repaired communicators need to be consistent after repair, that is the 
 state of the process A is the same in all the communicators of which this 
 process is a member.

Process recovery:
*****************

 The user can specify the following information on a per communicator basis:
   - List of the vital ranks, i.e. the ranks without which the application can't continue,
     and that even if restored, the application still can't recover.
   - The minimum communicator size still usable by the application.
   - Replacement policy  (the implementation must return an error in the recovery process
     if conflicting policies prevent consistent recovery)

   These would need to be set at communicator creation time.  By default, the communicator
   will use the policy set by the parent communicator.  In the case of intercommunicators,
   the default policy will be that set by the group associated with the parent communicator.

The user may specify the following information on a global basis:
   - List of the vital ranks within MPI_COMM_WORLD, i.e. the ranks without which the 
     application can't continue, and that even if restored, the application still can't recover.
   - The minimum communicator size still usable by the application.
   - Replacement policy

Default policies:
   - No vital ranks
   - Minimum communicator size: greater than zero
   - Replacement policy: restore failed processes

  The recovery process will be invoked by the MPI_Recover() function for local recovery 
 after point-to-point communications failure and by the MPI_Recover_collective() function
 for recovery after collective communications failure.  The recovery policy will be
 associated with the communicator, so need not be specified when initiating recovery.  Based on the
 recover policy, the recovery routines may:
  - restart failed processes.  This may restart multiple processes in a single function entry.  If the
    processes were restarted by another call to the recovery function (either by another process, or
    a previous call to a recovery function), another process with the same rank will not be created.
  - assign the restarted process a rank in the communicator
  - assign MPI_PROC_NULL to the processes that will not be restored

  Note: How should MPI be initialized ?  Should we have MPI_Init() called by the application ?  The implementation
 can set flags so it knows a restart is in progress.  Do we notify the application it is being restarted ?
 If so, how ?

Communicator life cycle:
************************
     - A communicator starts its life with N processes, and in an ACTIVE state.
     - Once an process fails, the communicator enters an error state.
     - When in error state:
        - Point-to-point communications
           - If the failed process is not involved in the communications, the local part of the
             communicator will NOT enter into en error state, and communications will continue
             as though no error has occurred.
           - If the failed process IS involved in the communications, the local part of the
             communicator will BE marked as in ERROR state.  The application must call the
             MPI recovery function to restore the state of the communicator to ACTIVE.  No
             communications can process while the communicator is in an error state (Question:
             This could require checking error state all the time, so may need to think of a
             better way to do this).
        - Collective communications
           - Any process failure with outstanding communications puts all parts of the communicator
             in an error state.  A collective MPI recovery function must be called to restore the
             state of the communicator.  This gives the implementation the ability to perform
             any optimization it typically will do for collective operations when communicators
             are constructed.


Enabling error notification:
****************************
     - The application needs to set MPI_ERRORS_RETURN in all processes if it is to take advantage
       of MPI's error recovery capabilities.

Layered Libraries:
******************
   The intent of the approach to fault tolerance in MPI is to give the application an opportunity 
 to recover even if layered libraries are being used.  This includes the cases where libraries 
 and legacy libraries are being used, which may not have support for fault tolerance built into 
 them, and if they do, may need to invoke different recovery policies.  The following items are
 specifically aimed at providing this support:
 - Error state and error codes being associated with a communicator.  The assumption is that libraries
   will isolate their communications from other MPI communications used in the application run by using
   their own communicator.
 - Recovery is communicator based.

   Supporting the use of "legacy libraries" - libraries that do not support fault tolerance, and never
 will, as these are important use case scenarios for existing applications - is desired.  If a library
 can be restarted with out restarting the processes running this library, it meets this condition.

Dynamic communicators:
**********************

Error Callbacks:
****************
  For asynchronous error notification, a callback scheme needs to be supported, so that applications
 can register the functions they want to invoke when notified of the error.

 The following is the list of operations that are allowed within a call-back routine:
 - process local work
 - querying MPI for error data

 Disallowed:
  - MPI communications
  - MPI recovery

API:
****