[Mpi3-ft] Talking Points Notes for Plenary Session

Josh Hursey jjhursey at open-mpi.org
Thu Dec 15 15:44:40 CST 2011

As I mentioned on the teleconf today, I am putting together slides for
the plenary session to introduce the MPI Forum members to the
proposal. Attached is a talking points summary of the slides (I'm just
starting on the slide layout now).

The goal is to introduce the forum to the concepts in the proposal as
a way to scope the first reading.

If you have a chance, let me know what you think. I am planning on
sending around an initial version of the slides along with the 'first
reading ready' version of the document on Monday, Dec. 19. A new
version of the document will be available tomorrow (Friday, Dec. 16)
for initial review.


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
- Josh Hursey
- 12/15/2011

Target Application Behaviors:
* Algorithm-Based Fault Tolerance (ABFT) & Naturally Fault Tolerant
  - After process failure, move forward with all messages in the queue
* Rollback Recovery Fault Tolerance
  - After process failure, flush pending messages and rendezvous at a previous step
* Combinations of the above
  - There are many stages of transition between these two types of fault tolerant behavior.
  - The interfaces and semantics provided in this proposal allow for this range to be explored with MPI

Derived Goals:
* Fault notification and handling should be as local as possible
  - Opportunity to investigate more scalable and performant ABFT techniques
  - Default: Only those processes interacting with a failed process are notified/effected by it
  - Option: Receive a uniform, global notification though process failure handlers
* Failure detector
  - Proposal does not specify how it is implemented, just characteristics it must provide.
  - Guarantee that eventually every alive process will have the opportunity to know that a failure occurred.
  - Processes mistakenly identified as failed, will be forced to fail.
* Focus on defining the minimum set of MPI behavior to provide necessary flexibility
  - We do not specify access to semi-transparent services like checkpoint/restart or replication, but those servers could be built on top of this interface.

* Fail stop process failures.
  - All other types of failures remain undefined by the MPI standard at this time.
* Query Interfaces
  - Functionality to query for the current set of failed processes associated with a communication object.
* Point-to-Point
  - Isolate impact of process failure for directed communication.
* Collectives:
  - Fault-aware collectives and a fault tolerant group membership protocol to reestablish the collective group after failure.
* Communication Object Creation:
  - Uniform creation of objects across the collective group
* Process Failure Handlers:
  - A consistent, and uniform notification mechanism for process failure not tied to the current MPI operation.
* One-Sided Communication
  - Process failure invalidates the window and all outstanding operations
* I/O
  - Defines the state of MPI_File_() operations, but not the state of the file.
* Guidance & Examples
  - Suggested use cases

Model: Query Interfaces:
* Query interfaces return an MPI_Group representing the failed processes in the associated communication object.
  - MPI_Group operations can be used to reason about the failed set of processes
    - MPI_Group_translate_ranks: Determine if a specific 'rank' failed
    - MPI_Group_compare: Determine if any 'new' failures occurred since a previous point in time.
    - MPI_Group_size: Number of process failures.
    - ...
* Locally consistent list:
  - Good for applications that rely on directed communication (e.g., P2P).
  - For example, ABFT & naturally fault tolerant applications
  - The following new functions return a locally consistent MPI_Group of failed processes
    - MPI_Comm_group_failed()
    - MPI_Comm_remote_group_failed()
    - MPI_Win_get_group_failed()
    - MPI_File_get_group_failed()
  - The following returns a locally consistent MPI_Group of failed processes and re-activates MPI_ANY_SOURCE:
    - MPI_Comm_reenable_any_source()
* Globally consistent list:
  - Good for applications that:
    - Use collective operations
    - Need to know that all processes have rendezvous at a given location in the code
    - At this point can make the same decision based on a globally consistent list of failed processes
  - All of these operations are collective fault tolerant agreement/membership protocols
  - MPI_Comm_validate() / MPI_Icomm_validate()
    - Also reenables collective operations on the communicator.
  - MPI_Comm_validate_multiple() / MPI_Icomm_validate_multiple()
    - Essentially the same semantics as calling MPI_Comm_validate multiple times.
  - MPI_Win_validate() / MPI_Iwin_validate()
  - MPI_File_validate() / MPI_Ifile_validate()
    - 'sync-barrier-sync' semantics

Model: Point-to-Point:
* Directed communication:
  - The P2P operation is explicitly dependent on the specified source or destination
  - Operation is unaffected by processes failure unless the failed process is specified in the operation.
  - If the peer process is failed, the operation will complete in error.
* Wildcard communication:
  - The P2P operation is implicitly dependent on all processes in the communicator
  - If -any- process fails in the communicator, posted MPI_ANY_SOURCE operations will:
    - If blocking (MPI_Recv), complete in error.
    - If nonblocking (MPI_Irecv), return a warning, and not complete
      - User is provided the opportunity to cancel or continue waiting on the request without disturbing the matching queue.
  - While MPI_ANY_SOURCE is disabled, nonblocking wildcard receives posted before the process failure:
    - Can be matched and completed if an incoming message matches them
    - Can be canceled by the user
  - Subsequent MPI_ANY_SOURCE operations will immediately complete in error until the user explicitly reactivates them.
    - MPI_Comm_reenable_any_source() returns a locally consistent MPI_Group of failed processes and reactivates MPI_ANY_SOURCE operations.
    - This operation is how the user acknowledges the excluded processes are exempt from matching due to their failure in the communicator.
* Drain:
  - MPI_Comm_drain() / MPI_Icomm_drain()
  - A fault tolerant collective operation that completes all outstanding P2P communication at the time of the completion of the call.
    - Communication is either completed successfully, or are canceled by the library and return MPI_ERR_DRAINED.
  - Example:
     - For applications using a rollback technique, they may wish to 'flush' the communication channel due to a process failure. In this situation the application may have outstanding operations that will not complete due to rollback of program flow. Completing these messages becomes difficult for this application, and canceling each of them becomes cumbersome. This collective operation provides a cut in the communication for a specific communicator.

Model: Collectives:
* Collective operations are explicitly dependent on all processes in the communicator
  - If a failure occurs on the communicator then the communicator becomes collectively inactive
* Collectively inactive communicators:
  - Collectives must become fault-aware so that they are guaranteed to eventually complete in the presence of process failure with either success or MPI_ERR_PROC_FAIL_STOP, and not hang.
  - In the presence of new process failure, posted collective operations do not need to return the same error class at all processes. It is possible/probable that some process will return success while others the error MPI_ERR_PROC_FAIL_STOP.
  - Once collectively inactive, subsequent collective operations on the communicator will return MPI_ERR_PROC_FAIL_STOP.
* Collectively re-activating a communicator:
  - MPI_Comm_vaildate() re-activates the collectives on the communicator. Also resets the collective operation matching.
  - The user is provided a globally consistent list of failed processes that will be excluded from subsequent collective operations on the specified communicator.
  - This serves as an acknowledgment that the user agrees if they post further collectives they understand that the identified processes will not participate.
* After a process failure, the collective operations do -not- need to be called the same number of times.
  - MPI_Bcast example from the proposal

Model: Communication Object Creation:
* Communication creation operation are explicitly dependent on all processes in the input communicator.
* In the presence of process failure, the communication object(s) must be created everywhere or nowhere.
  - If a non-fatal error handler is registered on the input communicator, then the creation operation is required to synchronize.
* The input communicator must be collective active or the creation operation will complete with an error of MPI_ERR_PROC_FAIL_STOP at all processes.
  - If a new process failure occurs during the communicator creation operation, the operation is guaranteed to complete in error since the communicator will become collectively inactive.
* Similar to collective operations, after the call to MPI_Comm_validate() on the input communicator, the matching of communication object creation functions are reset.

Model: Process Failure Handlers:
* Provides the option of a consistently ordered callback triggered upon process failure of one or more processes.
* Unlike error handlers these are not tied to the current call site, but can be activated in any MPI operation.
* The following operations are provided to set/get process failure handlers:
  - MPI_Comm_set_failhandler()
  - MPI_Comm_get_failhandler()
  - MPI_Win_set_failhandler()
  - MPI_Win_get_failhandler()
  - MPI_File_set_failhandler()
  - MPI_File_get_failhandler()
  - MPI_FAILHANDLER_NULL: Default, unset failure handler object
* Callback has the same functional signature as the error handler.
* It is often the case that more than one failure handler will be activated due to a process failure
  - The ordering of process failure handler calls is consistent at all processes.
  - So collective operations can safely occur in the handlers.
* Process failure handlers are not called from within another process failure handler. If a new process failure occurs during the handler, the handler will be called again later.
  - Upon completion of a failure handler, the library determines if any additional failures happened during the call that the user was not made aware of via a globally consistent query call. If so then the failure handler will be activated again later in the calling sequence, again in a globally consistent manner.
  - Calling MPI_Comm_validate() can synchronize the failure handlers and make sure that a specific failure handler is called the same # of times with this rule.
* Failure handlers are 'set' collectively to ensure consistent calling order.
* Failure handlers are -not- inherited during communicator creation.
* Two modes of operation set globally by the user before the first registering the first failure handler.
  - MPI_Failhandler_set_mode()
  - MPI_Failhandler_get_mode()
  - All failure handlers registered are called for every set of process failures
  - The MPI library does not need to synchronize the failure set before calling the failure handler sequence. The user can choose to do this if they need to by using MPI_Comm_validate().
  - The subset of failure handlers registered to communicators that contain the failed process are added to the failure handler call sequence.
  - The MPI library must synchronize the known list of process failures before calling the first failure handler to determine a globally consistent failure handler call sequence across the processes.
* Application allowed to use any MPI operations (with the exception of MPI_Finalize) in the failure handler
  - For example, if the intention is to rollback then the application may want to drain the communicator and set a global rollback flag in their application.
  - Additionally, the failure handlers can be used to isolate much of the fault handling logic to just the failure full path. This can reduce the impact on the failure free path, though it may still need to check error codes.

Model: One-Sided Communication:
* The window is invalidated once any new process failure occurs in the window
  - All epochs are completed in error
  - Outstanding RMA operations will complete in error
  - Subsequent RMA operations will raise an error code of the class MPI_ERR_PROC_FAIL_STOP
  - Memory associated with the window is undefined
* Re-activating the window
  - The window can be re-created from MPI_Win_create(), or
  - The user can call MPI_Win_validate() to reenable RMA operations on a window that has been previously invalidated.

Model: I/O:
* After a process failure, the state of the external file must be determined by the application.
  - The application may be able to use MPI_File_read_at() and similar operations to make such determinations.
* MPI_File_validate()
  - Synchronizes the file to the disk (similar to sync-barrier-sync behavior)
  - Reenables collectives in the file handle
* Behavior of local and collective operations on file handles basically match the semantics of point-to-point and collective operations on communicators.
  - Local file operations are not effected by peer process failure
  - Collective operations are disabled when a new process failure occurs
* MPI_File_close:
  - Requires a collectively active file handle
  - If a new process failure is encountered then the file is -not- closed.
    - This allows the application to take corrective action before losing context with the file.

Model: Guidance & Examples:
* Before entering a library, call MPI_Comm_validate()
  - This way you know all processes will enter the library.
  - Advice is akin to suggesting that a termination detection protocol be used before giving control to the library.
* Before returning from a library, call MPI_Comm_validate()
  - This way you know that all processes will return together.
  - Advice is akin to suggesting that a termination detection protocol be used before returning control to the app.
* There are way around this, but requires more advanced cooperation between the application and library.
* What other examples might we want to include here?

More information about the mpiwg-ft mailing list