[mpiwg-rma] FENCE local completion requirements

Balaji, Pavan balaji at anl.gov
Tue Sep 30 08:20:47 CDT 2014


On Sep 30, 2014, at 7:59 AM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:

>> Now to my next point — barrier synchronization only means that other processes have entered WIN_FENCE, not returned from WIN_FENCE.  This means that they might not have completed all operations at the target.
>> 
>> If we change my initial example to the following:
>> 
>> MPI_WIN_FENCE(win1)
>> MPI_PUT(win1)
>> MPI_WIN_FENCE(win1)
>> 
>> MPI_WIN_LOCK(win2)
>> MPI_GET(win2)
>> MPI_WIN_UNLOCK(win2)
>> 
>> .. the GET might not see the data written by PUT, since the WIN_FENCE only guarantees local completion and barrier semantics of the second FENCE only ensures that the target has “called” FENCE, not “returned" from FENCE.
> 
> Yes, that would need explicit synchronization before the lock, and it is covered in the standard: pg 448, ln 11-20.

Right.  And so would the following single-window code:

MPI_WIN_FENCE(win)
MPI_PUT(win)
MPI_WIN_FENCE(win)

MPI_WIN_LOCK(win)
MPI_GET(win)
MPI_WIN_UNLOCK(win)

especially if I gave a MODE_NOCHECK argument to WIN_LOCK.

It seems unnecessary to have two synchronizations, one in the implementation and one in the user code.  That’s issue #1. In my proposal only one synchronization is needed.

Issue #2 is that the standard says "RMA operations on win started by a process after the fence call returns will access their target window only after MPI_WIN_FENCE has been called by the target process.”  That’s a bogus statement as it doesn’t say anything about the state of the target window.  The target window is in a defined state only after WIN_FENCE returns, not after it has been called (e.g., if the MPI implementation queues up all the RMA operations and issues them during WIN_FENCE).

I believe the intention of the statement was to say that “RMA operations on win started by a process after the fence call returns will access the target window only after all operations to that target in the previous epoch have completed.”

Issue #3 is whether FENCE needs remote completion.  Note that your statement in one of the previous emails that the previous text ensures that you don’t need to send an ACK back from the target to the origin might not be valid on most systems.  For example, this doesn’t work for any system that has hardware RDMA since the target doesn’t know of the operation.  It also does not work for systems that use active messages unless the MPI implementation wants to queue up all of the potentially large number of RMA operations till the next FENCE, since it doesn’t know if the target process has called/completed FENCE yet.  My point is that the “optimization” of not requiring remote completion might not be very useful in practice, though from the standards perspective I’d be OK with allowing this model.

My proposal is to clean this up with the following two changes --

1. Update the text in the standard to say: "A call to MPI_WIN_FENCE that ends an epoch entais a barrier synchronization that ensures that all operations issued in the epoch have completed both locally and at the target.  A call to MPI_WIN_FENCE that is known not to end an epoch (in particular, a call with assert equal to MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.”

2. Add a new assert that brings back the local-completion-only semantics similar to MPI-3.

  — Pavan

--
Pavan Balaji  ✉️
http://www.mcs.anl.gov/~balaji



More information about the mpiwg-rma mailing list