[Mpi3-ft] Fault Tolerance & RMA Discussion

Josh Hursey jjhursey at open-mpi.org
Thu Feb 2 14:00:55 CST 2012

We made some really good progress on today's call. Attached are some notes
that I took from the call.

At the end of the call there were a couple of items that we wanted to get a
finer understanding of. As a result we are going to try to setup another

Below is a doodle poll to pick a date/time:

If you are interested in attending this teleconf, please fill out the poll
by 2 pm Eastern on Monday, Feb. 6.


On Thu, Feb 2, 2012 at 10:01 AM, Josh Hursey <jjhursey at open-mpi.org> wrote:

> Just a reminder that we are meeting today at Noon Eastern to discuss RMA
> in the context of the fault tolerance proposal.
> The Run-Through Stabilization proposal can be found attached to the ticket:
>   https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/276
> https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/ticket/276/FTWG-Process-FT-Draft-2011-12-20.pdf
> We will be focusing on section 17.11 of that document. Note that this
> section does not currently explicitly account for the new RMA proposal, but
> we would like to remedy that for the next reading.
> Thanks,
> Josh
> On Wed, Jan 25, 2012 at 3:15 PM, Josh Hursey <jjhursey at open-mpi.org>wrote:
>> There was no one date/time that worked for everyone, but I chose a time
>> that worked for most of the respondents. We will meet Thursday, Feb. 2 from
>> 12-1 pm EST/New York to discuss this topic.
>> We can use the following teleconf information:
>>   US Toll Free number: 877-801-8130
>>   Toll number: 1-203-692-8690
>>   Access Code: 1044056
>> Thanks,
>> Josh
>> On Mon, Jan 23, 2012 at 4:33 PM, Josh Hursey <jjhursey at open-mpi.org>wrote:
>>> (Cross posted to both the RMA and FT MPI-3 listservs)
>>> During the FT plenary session at the Jan. MPI Forum meeting it was
>>> recommended that some of the members of the FT group and the RMA group have
>>> a meeting to hash out the precise details of the FT semantics for the RMA
>>> chapter. So I would like to facilitate such a discussion, preferability in
>>> the next week (so we have time to fine tune things before the next forum
>>> meeting).
>>> In general, we are trying to answer the question "How should RMA
>>> operations behave when a process failure occurs?" The feeling seemed to be
>>> that the current approach is ok (invalidating the window, forcing
>>> recreation/validation), but the statement that the memory exposed in the
>>> window is 'undefined' seemed excessive. The suggestion was to change the
>>> wording to something like "Only the memory associated with a window that
>>> was targeted by an operation that modified it is undefined after process
>>> failure in the group associated with the window." This lead to a
>>> considerable amount of debate in the meeting, so it was suggested that we
>>> take the discussion offline.
>>> Below is a link to a doodle poll to find a good time for a teleconf. If
>>> you are interested in participating in this discussion, please fill this
>>> poll out by 2 PM Eastern on Wed. Jan 25 so we can set the date/time.
>>>    http://www.doodle.com/vd33va5h8iankega
>>> Thanks,
>>> Josh
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120202/4645398c/attachment-0001.html>
-------------- next part --------------
- Window creation semantics are good.
- Window must be invalidated upon local detection of process failure.
  - Local invalidation completes in error any outstanding epochs, no new RMA operations can be posted on the invalidated window, and outstanding RMA operations are completed in error.
- RMA operations that returned a MPI_Request object are required to Test/Wait on those requests to complete them even if the window was invalidated due to an emerging process failure.
- RMA epoch termination operations do not need to be called once a window becomes invalidated since the invalidation operation completes these epochs. Users may call these operations, but they will return in error.
- Existing windows that are invalidated may be validated by calling MPI_Win_validate. After which the window is able to be used for RMA operations until a subsequent process failure invalidates it again.
- Alternatively, the user can call MPI_Win_free to destroy the window if they no longer need it, or wish to create a new window with a different membership.
- Once invalidated, only the memory targeted by other processes during the failed epoch are undefined.

Window Creation in the presence of failure
- Current specification seems safe and correct, if not seemingly heavier weight.
- Maybe we can consider loosening the restrictions later via an info argument, if needed. Challenge is if we allow the created window to be partially valid upon creation then the semantics become more nuanced and this might limit algorithmic choices of the implementation (such as the use of distributed locks).
- Semantics:
  - MPI_SUCCESS: Window created everywhere, and is valid for window operations.
  - MPI_ERR_PROC_FAIL_STOP: Window not created everywhere. No window object created. User must call MPI_Comm_validate on the input communicator to 'collective reenable' the communicator before calling MPI_Win_{allocate,create,create_dynamic}.

State of the window after a new process failure
- p564 line 42-43: Change "invalidated once any new process failure occurs in the ..." to something -like- "invalidated once any new process failure is locally detected in the ...". Needs some further word smithing, but the idea is that the window in locally invalidated once the local process is notified of the failure (eventually everyone will invalidate their windows too). So it is possible that for a period of time some alive process that does not know of the process failure yet might successfully complete RMA operations until it is notified of the failure.
- The distributed locks implementation option becomes complicated if we do not invalidate the entire window when a new process failure occurs.
- What should we advise users to do if they called MPI_Win_lock_all, then sometime later a process failure invalidates the window. Should they be required to call MPI_Win_unlock_all before destroying/validating the window?
  - Some options below:
    - Require the user to call MPI_Win_unlock_all and complete outstanding requests (maybe inside a process failure handler callback).
    - All epochs are completed in error when a window is invalidated, so attempting to complete the epoch with MPI_Win_unlock_all will return an error.
    - Validate creates a logically new window.
  - I think we decided that since all epochs are completed in error when a window is invalidated then a call to MPI_Win_unlock_all or other epoch operations will return an error since there are no outstanding epochs and no new epochs can be started on an invalidated window.
- Considering MPI_Win_free the intention of the FT proposal was to allow the user to call this operation once a process failure occurs without requiring additional work to cleanup the window before doing so. In that sense it would not be wanted to force the user to call MPI_Win_unlock_all if the window is invalidated and the epoch is already complete.
- For RMA operations with requests (e.g., Rput, Rget, ...) the user -is- required to call test/wait to complete the request, even if the window is invalidated.
  - This is symmetrical with the semantics for nonblocking two-sided and collective operations which require the user to call test/wait even in the presence of process failure.
  - The user can call the test/wait after destroying or validating the window.
  - Test/Wait tell the user if the operation associated with the request succeeded or not.
  - Suggested advice to users: "Even though the open epoch completed in error, it is possible that some of the operations completed successfully during the epoch."
    - Question of if this situation is meaningful to the user if -all- memory in the window is undefined.

State of the memory associated with the window after a new process failure
- Currently we say that -all- of the memory associated with the window after a new process failure is -undefined-.
- It is possible that an application could expose all of the process memory in the window. So we would like something more precise.
- Suggested "Only the memory blocks associated with the window during creation and targeted by any processes during the failed epoch is undefined."
  - Note that 'is undefined' does not imply any responsibility on the MPI implementation to identify these regions (which might be difficult to do). The user must reason about these undefined regions above MPI.
  - One sided operations are meant to support a model the encourages continuous fine grained access to remote memory. In this case a failing node results in a lot of collateral damage.

Meaning of RMA operations across a process failure:
- Consider 2 alive processes interacting, but a third process fails invalidating the window
  - The epoch is completed in error, but what does that mean in relation to operations posted during the epoch?
  - Get/Put/Accumulate/get_accumulate/fetch_and_op/compare_and_swap:
    - Since the epoch is completed in error, and there are no other handles to these operations then the memory targeted by these operations is undefined.
  - Rput/Rget/Raccumulate/Rget_accumulate
    - Rget if the request returns MPI_SUCCESS then the buffer is known to be valid
    - Rput if the request returns MPI_SUCCESS then the user knows that the buffer can be reused, but cannot be sure if the value reached the other side (kinda like MPI_Send - it could have been buffered).
    - Rget_accumulate if the request return MPI_SUCCESS then the output buffer is valid, input buffer is available for reuse, but unknown if the remote side was updated.
  - Other operation combinations we should consider?
  - Do any of these require additional advice to users?

- If the window is valid, then MPI_Win_validate is a process synchronization primitive. Strong barrier in a sense. There is no implication for outstanding RMA operations, or epochs.
- When the window is invalidated this essentially:
  (1) destroys the old window context
  (2) validate the group participating in the window
  (3) creates a new window context and associates it with the old handle (input 'win')
- Should the input 'win' really be 'inout'?
- Recreating the window is critical for collective window allocation operations.

More information about the mpiwg-ft mailing list