[mpiwg-rma] MPI RMA status summary

Tue Sep 30 10:05:47 CDT 2014

There are two separate issues that are being confused: memory barrier and interprocess synchronization. Jeff's note and Bill's a, b, c choices are for memory barrier. Rolf's ticket 456 is for interprocess synchronization, i.e., whether the interprocess synchronization semantics of fence/pscw for RMA operations also apply to shared-memory loads/stores.

Rajeev

On Sep 30, 2014, at 9:35 AM, Rolf Rabenseifner <rabenseifner at hlrs.de> wrote:

> There are different ways to look at it.
> Remote memory access like remote load and store can 
> be interpreted as RMA by one reader and non-RMA by another one.
> 
> Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, 
> Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur
> in 
> "MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared Memory" 
> at EuroMPI 2012
> obviously treated them as RMA, because otherwise their store+Fence+load would
> have been wrong.
> 
> The major reason for going with b) and not with a) is simple:
> 
> If somenbody does not want to have a shared memory 
> synchronization semantics for an existing MPI_Win 
> synchronization routine (e.g. MPI_Win_fence), then
> he/she Need not to use the Routine together with 
> shared memory windows.
> If he/she wants to stay with MP_Put/Get,
> then he/she needs not to use MPI_Win_allocate_shared.
> It is allowed to use MPI_Win_allocate on a shered memory node
> and MPI internally can do the same optimizations.
> 
> Therefore shared memory synchronization semantics
> for MPI_Win_fence, MPI_Win_post/start/complete/wait, 
> MPI_Win_lock/unlock, ...
> is never a drawback. 
> 
> But it is a clear advantage, because process-to-process
> synchronization is combined with local memory synchronization
> which may be implemented faster than if seperated into
> different routines.
> 
> Rolf 
> 
> 
> ----- Original Message -----
>> From: "Jeff Hammond" <jeff.science at gmail.com>
>> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
>> Sent: Tuesday, September 30, 2014 4:09:02 PM
>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>> 
>> Option A is what the standard says today outside of a sloppy offhand
>> remark in parentheses. See my note please.
>> 
>> Jeff
>> 
>> Sent from my iPhone
>> 
>>> On Sep 30, 2014, at 7:05 AM, Rolf Rabenseifner
>>> <rabenseifner at hlrs.de> wrote:
>>> 
>>> I strongly agree with your statement:
>>>> These piecemeal changes are one of the sources of our problems.
>>> 
>>> I only wanted to strongly say, that I would never vote for your a),
>>> because it is not backward compatible to what is already used.
>>> And with b) I've the problem that b1) is clear to me (see #456),
>>> but the Win_flush semantics for load/store is unclear to me.
>>> 
>>> Of course, a total solution is needed and not parts of it.
>>> #456 is such a trial for a complete solution.
>>> 
>>> Rolf
>>> 
>>> ----- Original Message -----
>>>> From: "William Gropp" <wgropp at illinois.edu>
>>>> To: "MPI WG Remote Memory Access working group"
>>>> <mpiwg-rma at lists.mpi-forum.org>
>>>> Sent: Tuesday, September 30, 2014 3:19:06 PM
>>>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>>>> 
>>>> I disagree with this approach.  The most important thing to do is
>>>> to
>>>> figure out the correct definitions and semantics.  Once we agree
>>>> on
>>>> that, we can determine what can be handled as an errata and what
>>>> will require an update to the chapter and an update to the MPI
>>>> standard.  These piecemeal changes are one of the sources of our
>>>> problems.
>>>> 
>>>> Bill
>>>> 
>>>> On Sep 30, 2014, at 7:38 AM, Rolf Rabenseifner
>>>> <rabenseifner at hlrs.de>
>>>> wrote:
>>>> 
>>>>>> Vote for 1 of the following:
>>>>>> 
>>>>>> a) Only Win_sync provides memory barrier semantics to shared
>>>>>> memory
>>>>>> windows
>>>>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
>>>>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
>>>>>> c) Some as yet undetermined blend of a and b, which might
>>>>>> include
>>>>>> additional asserts
>>>>>> d) This topic needs further discussion
>>>>> 
>>>>> Because we only have to clarify MPI-3.0 (this is an errata issue)
>>>>> and
>>>>> - obviously the MPI Forum and the readers expected that
>>>>> MPI_Win_fence
>>>>> (and therefore also the other MPI-2 synchronizations
>>>>> MPI_Win_post/start/complete/wait and MPI_Win_lock/unlock)
>>>>> works if MPI_Get/Put are sustituted by shared memory load/store
>>>>> (see the many Forum members as authors of the EuroMPI paper)
>>>>> - and the Forum decided that also MPI_Win_sync acts as if
>>>>> a memory barrier is inside,
>>>>> for me,
>>>>> - a) cannot be chosen bacause an erratum cannot remove
>>>>>   a given functionality
>>>>> - and b) is automatically given, see reasons above. Therefore
>>>>> #456.
>>>>> 
>>>>> The only open question for me is about the meaning of
>>>>> MPI_Win_flush.
>>>>> Therefore MPI_Win_flush is still missing in #456.
>>>>> 
>>>>> Therefore for me, the Major choices seems to be
>>>>> b1) MPI-2 synchronizations + MPI_Win_sync
>>>>> b2) MPI-2 synchronizations + MPI_Win_sync + MPI_Win_flush
>>>>> 
>>>>> For this vote, I clearly want to see a clear proposal
>>>>> about the meaning of MPI_Win_flush together with
>>>>> sared memory load/store, hopefully with the notation
>>>>> used in #456.
>>>>> 
>>>>> Best regards
>>>>> Rolf
>>>>> 
>>>>> ----- Original Message -----
>>>>>> From: "William Gropp" <wgropp at illinois.edu>
>>>>>> To: "MPI WG Remote Memory Access working group"
>>>>>> <mpiwg-rma at lists.mpi-forum.org>
>>>>>> Sent: Monday, September 29, 2014 11:39:51 PM
>>>>>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>>>>>> 
>>>>>> 
>>>>>> Thanks, Jeff.
>>>>>> 
>>>>>> 
>>>>>> I agree that I don’t want load/store to be considered RMA
>>>>>> operations.
>>>>>> But the issue of the memory consistency on RMA synchronization
>>>>>> and
>>>>>> completion operations to a shared memory window is complex.  In
>>>>>> some
>>>>>> ways, the most consistent with RMA in other situations is the
>>>>>> case
>>>>>> of MPI_Win_lock to your own process; the easiest extension for
>>>>>> the
>>>>>> user is to have reasonably strong memory barrier semantics at
>>>>>> all
>>>>>> sync/completion operations (thus including Fence).  As you note,
>>>>>> this has costs.  At the other extreme, we could say that only
>>>>>> Win_sync provides these memory barrier semantics.  And we could
>>>>>> pick
>>>>>> a more complex blend (yes for some, no for others).
>>>>>> 
>>>>>> 
>>>>>> One of the first questions is whether we want to only Win_sync,
>>>>>> all
>>>>>> completion/sync RMA routines, or some subset to provide memory
>>>>>> barrier semantics for shared memory windows (this would include
>>>>>> RMA
>>>>>> windows that claimed to be shared memory, since there is a
>>>>>> proposal
>>>>>> to extend that property to other RMA windows).  It would be good
>>>>>> to
>>>>>> make progress on this question, so I propose a straw vote of
>>>>>> this
>>>>>> group by email.  Vote for 1 of the following:
>>>>>> 
>>>>>> 
>>>>>> a) Only Win_sync provides memory barrier semantics to shared
>>>>>> memory
>>>>>> windows
>>>>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
>>>>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
>>>>>> c) Some as yet undetermined blend of a and b, which might
>>>>>> include
>>>>>> additional asserts
>>>>>> d) This topic needs further discussion
>>>>>> 
>>>>>> 
>>>>>> Note that I’ve left off what “memory barrier semantics” means.
>>>>>> That
>>>>>> will need to be precisely defined for the standard, but I
>>>>>> believe
>>>>>> we
>>>>>> know what we intend for this.  We specifically are not defining
>>>>>> what
>>>>>> happens with non-MPI code.  Also note that this is separate from
>>>>>> whether the RMA sync routines appear to be blocking when applied
>>>>>> to
>>>>>> a shared memory window; we can do a separate straw vote on that.
>>>>>> 
>>>>>> 
>>>>>> Bill
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sep 29, 2014, at 3:49 PM, Jeff Hammond <
>>>>>> jeff.science at gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Sep 29, 2014 at 9:16 AM, Rolf Rabenseifner <
>>>>>> rabenseifner at hlrs.de > wrote:
>>>>>> 
>>>>>> 
>>>>>> Only about the issues on #456 (shared memory syncronization):
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> For the ones requiring discussion, assign someone to organize a
>>>>>> position and discussion.  We can schedule telecons to go over
>>>>>> those
>>>>>> issues.  The first item in the list is certainly in this class.
>>>>>> 
>>>>>> Who can organize telecons on #456.
>>>>>> Would it be possible to organize a RMA meeting at SC?
>>>>>> 
>>>>>> I will be there Monday through part of Thursday but am usually
>>>>>> triple-booked from 8 AM to midnight.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> The position expressed by the solution #456 is based on the idea
>>>>>> that the MPI RMA synchronization routines should have the same
>>>>>> outcome when RMA PUT and GET calls are substituted by stores and
>>>>>> loads.
>>>>>> 
>>>>>> The outcome for the flush routines is still not defined.
>>>>>> 
>>>>>> It is interesting, because the standard is actually conflicting
>>>>>> on
>>>>>> whether Flush affects load-store.  I find this incredibly
>>>>>> frustrating.
>>>>>> 
>>>>>> Page 450:
>>>>>> 
>>>>>> "Locally completes at the origin all outstanding RMA operations
>>>>>> initiated by the calling process to the target process specified
>>>>>> by
>>>>>> rank on the specified window. For example, after this routine
>>>>>> completes, the user may reuse any buffers provided to put, get,
>>>>>> or
>>>>>> accumulate operations."
>>>>>> 
>>>>>> I do not not think "RMA operations" includes load-store.
>>>>>> 
>>>>>> Page 410:
>>>>>> 
>>>>>> "The consistency of load/store accesses from/to the shared
>>>>>> memory
>>>>>> as
>>>>>> observed by the user program depends on the architecture. A
>>>>>> consistent
>>>>>> view can be created in the unified memory model (see Section
>>>>>> 11.4)
>>>>>> by
>>>>>> utilizing the window synchronization functions (see Section
>>>>>> 11.5)
>>>>>> or
>>>>>> explicitly completing outstanding store accesses (e.g., by
>>>>>> calling
>>>>>> MPI_WIN_FLUSH)."
>>>>>> 
>>>>>> Here it is unambiguously implied that MPI_WIN_FLUSH affects
>>>>>> load-stores.
>>>>>> 
>>>>>> My preference is to fix the statement on 410 since it is less
>>>>>> canonical than the one on 450, and because I do not want to have
>>>>>> a
>>>>>> memory barrier in every call to WIN_FLUSH.
>>>>>> 
>>>>>> Jeff
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I would prefere to have an organizer of the discussion inside of
>>>>>> the RMA subgroup that proposed the changes for MPI-3.1
>>>>>> rather that I'm the organizer.
>>>>>> I tried to bring all the input together and hope that #456
>>>>>> is now state that it is consistent itsself and with the
>>>>>> expectations expressed by the group that published the
>>>>>> paper at EuroMPI on first usage of this shared memory interface.
>>>>>> 
>>>>>> The ticket is (together with the help of recent C11
>>>>>> standadization)
>>>>>> on a good way to be also consistent with the compiler
>>>>>> optimizations -
>>>>>> in other words - the C standardization body has learnt from the
>>>>>> pthreads problems. Fortran is still an open question to me,
>>>>>> i.e., I do not know the status, see
>>>>>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/456#comment:13
>>>>>> 
>>>>>> Best regards
>>>>>> Rolf
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>> 
>>>>>> 
>>>>>> From: "William Gropp" <wgropp at illinois.edu>
>>>>>> To: "MPI WG Remote Memory Access working group"
>>>>>> <mpiwg-rma at lists.mpi-forum.org>
>>>>>> Sent: Thursday, September 25, 2014 4:19:14 PM
>>>>>> Subject: [mpiwg-rma] MPI RMA status summary
>>>>>> 
>>>>>> I looked through all of the tickets and wrote a summary of the
>>>>>> open
>>>>>> issues, which I’ve attached.  I propose the following:
>>>>>> 
>>>>>> Determine which of these issues can be resolved by email.  A
>>>>>> significant number can probably be closed with no further
>>>>>> action.
>>>>>> 
>>>>>> For those requiring rework, determine if there is still interest
>>>>>> in
>>>>>> them, and if not, close them as well.
>>>>>> 
>>>>>> For the ones requiring discussion, assign someone to organize a
>>>>>> position and discussion.  We can schedule telecons to go over
>>>>>> those
>>>>>> issues.  The first item in the list is certainly in this class.
>>>>>> 
>>>>>> Comments?
>>>>>> 
>>>>>> Bill
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>>> 
>>>>>> --
>>>>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>>>>>> rabenseifner at hlrs.de
>>>>>> High Performance Computing Center (HLRS) . phone
>>>>>> ++49(0)711/685-65530
>>>>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>>>>>> 685-65832
>>>>>> Head of Dpmt Parallel Computing . . .
>>>>>> www.hlrs.de/people/rabenseifner
>>>>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>>>>>> 1.307)
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> jeff.science at gmail.com
>>>>>> http://jeffhammond.github.io/
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>> 
>>>>> --
>>>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>>>>> rabenseifner at hlrs.de
>>>>> High Performance Computing Center (HLRS) . phone
>>>>> ++49(0)711/685-65530
>>>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>>>>> 685-65832
>>>>> Head of Dpmt Parallel Computing . . .
>>>>> www.hlrs.de/people/rabenseifner
>>>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>>>>> 1.307)
>>>>> _______________________________________________
>>>>> mpiwg-rma mailing list
>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>> 
>>>> _______________________________________________
>>>> mpiwg-rma mailing list
>>>> mpiwg-rma at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> 
>>> --
>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>>> rabenseifner at hlrs.de
>>> High Performance Computing Center (HLRS) . phone
>>> ++49(0)711/685-65530
>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>>> 685-65832
>>> Head of Dpmt Parallel Computing . . .
>>> www.hlrs.de/people/rabenseifner
>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>>> 1.307)
>>> _______________________________________________
>>> mpiwg-rma mailing list
>>> mpiwg-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> 
> -- 
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma