[mpiwg-rma] MPI RMA status summary

Tue Sep 30 10:02:52 CDT 2014

The current straw vote question is only about memory synchronization.  We are not discussing whether the RMA routines provide interprocess synchronization when applied to shared memory windows.  That is a separate question that we will discuss later.

Bill

On Sep 30, 2014, at 9:35 AM, Rolf Rabenseifner <rabenseifner at hlrs.de> wrote:

> There are different ways to look at it.
> Remote memory access like remote load and store can 
> be interpreted as RMA by one reader and non-RMA by another one.
> 
> Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, 
> Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur
> in 
> "MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared Memory" 
> at EuroMPI 2012
> obviously treated them as RMA, because otherwise their store+Fence+load would
> have been wrong.
> 
> The major reason for going with b) and not with a) is simple:
> 
> If somenbody does not want to have a shared memory 
> synchronization semantics for an existing MPI_Win 
> synchronization routine (e.g. MPI_Win_fence), then
> he/she Need not to use the Routine together with 
> shared memory windows.
> If he/she wants to stay with MP_Put/Get,
> then he/she needs not to use MPI_Win_allocate_shared.
> It is allowed to use MPI_Win_allocate on a shered memory node
> and MPI internally can do the same optimizations.
> 
> Therefore shared memory synchronization semantics
> for MPI_Win_fence, MPI_Win_post/start/complete/wait, 
> MPI_Win_lock/unlock, ...
> is never a drawback. 
> 
> But it is a clear advantage, because process-to-process
> synchronization is combined with local memory synchronization
> which may be implemented faster than if seperated into
> different routines.
> 
> Rolf 
> 
> 
> ----- Original Message -----
>> From: "Jeff Hammond" <jeff.science at gmail.com>
>> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
>> Sent: Tuesday, September 30, 2014 4:09:02 PM
>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>> 
>> Option A is what the standard says today outside of a sloppy offhand
>> remark in parentheses. See my note please.
>> 
>> Jeff
>> 
>> Sent from my iPhone
>> 
>>> On Sep 30, 2014, at 7:05 AM, Rolf Rabenseifner
>>> <rabenseifner at hlrs.de> wrote:
>>> 
>>> I strongly agree with your statement:
>>>> These piecemeal changes are one of the sources of our problems.
>>> 
>>> I only wanted to strongly say, that I would never vote for your a),
>>> because it is not backward compatible to what is already used.
>>> And with b) I've the problem that b1) is clear to me (see #456),
>>> but the Win_flush semantics for load/store is unclear to me.
>>> 
>>> Of course, a total solution is needed and not parts of it.
>>> #456 is such a trial for a complete solution.
>>> 
>>> Rolf
>>> 
>>> ----- Original Message -----
>>>> From: "William Gropp" <wgropp at illinois.edu>
>>>> To: "MPI WG Remote Memory Access working group"
>>>> <mpiwg-rma at lists.mpi-forum.org>
>>>> Sent: Tuesday, September 30, 2014 3:19:06 PM
>>>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>>>> 
>>>> I disagree with this approach.  The most important thing to do is
>>>> to
>>>> figure out the correct definitions and semantics.  Once we agree
>>>> on
>>>> that, we can determine what can be handled as an errata and what
>>>> will require an update to the chapter and an update to the MPI
>>>> standard.  These piecemeal changes are one of the sources of our
>>>> problems.
>>>> 
>>>> Bill
>>>> 
>>>> On Sep 30, 2014, at 7:38 AM, Rolf Rabenseifner
>>>> <rabenseifner at hlrs.de>
>>>> wrote:
>>>> 
>>>>>> Vote for 1 of the following:
>>>>>> 
>>>>>> a) Only Win_sync provides memory barrier semantics to shared
>>>>>> memory
>>>>>> windows
>>>>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
>>>>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
>>>>>> c) Some as yet undetermined blend of a and b, which might
>>>>>> include
>>>>>> additional asserts
>>>>>> d) This topic needs further discussion
>>>>> 
>>>>> Because we only have to clarify MPI-3.0 (this is an errata issue)
>>>>> and
>>>>> - obviously the MPI Forum and the readers expected that
>>>>> MPI_Win_fence
>>>>> (and therefore also the other MPI-2 synchronizations
>>>>> MPI_Win_post/start/complete/wait and MPI_Win_lock/unlock)
>>>>> works if MPI_Get/Put are sustituted by shared memory load/store
>>>>> (see the many Forum members as authors of the EuroMPI paper)
>>>>> - and the Forum decided that also MPI_Win_sync acts as if
>>>>> a memory barrier is inside,
>>>>> for me,
>>>>> - a) cannot be chosen bacause an erratum cannot remove
>>>>>   a given functionality
>>>>> - and b) is automatically given, see reasons above. Therefore
>>>>> #456.
>>>>> 
>>>>> The only open question for me is about the meaning of
>>>>> MPI_Win_flush.
>>>>> Therefore MPI_Win_flush is still missing in #456.
>>>>> 
>>>>> Therefore for me, the Major choices seems to be
>>>>> b1) MPI-2 synchronizations + MPI_Win_sync
>>>>> b2) MPI-2 synchronizations + MPI_Win_sync + MPI_Win_flush
>>>>> 
>>>>> For this vote, I clearly want to see a clear proposal
>>>>> about the meaning of MPI_Win_flush together with
>>>>> sared memory load/store, hopefully with the notation
>>>>> used in #456.
>>>>> 
>>>>> Best regards
>>>>> Rolf
>>>>> 
>>>>> ----- Original Message -----
>>>>>> From: "William Gropp" <wgropp at illinois.edu>
>>>>>> To: "MPI WG Remote Memory Access working group"
>>>>>> <mpiwg-rma at lists.mpi-forum.org>
>>>>>> Sent: Monday, September 29, 2014 11:39:51 PM
>>>>>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>>>>>> 
>>>>>> 
>>>>>> Thanks, Jeff.
>>>>>> 
>>>>>> 
>>>>>> I agree that I don’t want load/store to be considered RMA
>>>>>> operations.
>>>>>> But the issue of the memory consistency on RMA synchronization
>>>>>> and
>>>>>> completion operations to a shared memory window is complex.  In
>>>>>> some
>>>>>> ways, the most consistent with RMA in other situations is the
>>>>>> case
>>>>>> of MPI_Win_lock to your own process; the easiest extension for
>>>>>> the
>>>>>> user is to have reasonably strong memory barrier semantics at
>>>>>> all
>>>>>> sync/completion operations (thus including Fence).  As you note,
>>>>>> this has costs.  At the other extreme, we could say that only
>>>>>> Win_sync provides these memory barrier semantics.  And we could
>>>>>> pick
>>>>>> a more complex blend (yes for some, no for others).
>>>>>> 
>>>>>> 
>>>>>> One of the first questions is whether we want to only Win_sync,
>>>>>> all
>>>>>> completion/sync RMA routines, or some subset to provide memory
>>>>>> barrier semantics for shared memory windows (this would include
>>>>>> RMA
>>>>>> windows that claimed to be shared memory, since there is a
>>>>>> proposal
>>>>>> to extend that property to other RMA windows).  It would be good
>>>>>> to
>>>>>> make progress on this question, so I propose a straw vote of
>>>>>> this
>>>>>> group by email.  Vote for 1 of the following:
>>>>>> 
>>>>>> 
>>>>>> a) Only Win_sync provides memory barrier semantics to shared
>>>>>> memory
>>>>>> windows
>>>>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
>>>>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
>>>>>> c) Some as yet undetermined blend of a and b, which might
>>>>>> include
>>>>>> additional asserts
>>>>>> d) This topic needs further discussion
>>>>>> 
>>>>>> 
>>>>>> Note that I’ve left off what “memory barrier semantics” means.
>>>>>> That
>>>>>> will need to be precisely defined for the standard, but I
>>>>>> believe
>>>>>> we
>>>>>> know what we intend for this.  We specifically are not defining
>>>>>> what
>>>>>> happens with non-MPI code.  Also note that this is separate from
>>>>>> whether the RMA sync routines appear to be blocking when applied
>>>>>> to
>>>>>> a shared memory window; we can do a separate straw vote on that.
>>>>>> 
>>>>>> 
>>>>>> Bill
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sep 29, 2014, at 3:49 PM, Jeff Hammond <
>>>>>> jeff.science at gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Sep 29, 2014 at 9:16 AM, Rolf Rabenseifner <
>>>>>> rabenseifner at hlrs.de > wrote:
>>>>>> 
>>>>>> 
>>>>>> Only about the issues on #456 (shared memory syncronization):
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> For the ones requiring discussion, assign someone to organize a
>>>>>> position and discussion.  We can schedule telecons to go over
>>>>>> those
>>>>>> issues.  The first item in the list is certainly in this class.
>>>>>> 
>>>>>> Who can organize telecons on #456.
>>>>>> Would it be possible to organize a RMA meeting at SC?
>>>>>> 
>>>>>> I will be there Monday through part of Thursday but am usually
>>>>>> triple-booked from 8 AM to midnight.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> The position expressed by the solution #456 is based on the idea
>>>>>> that the MPI RMA synchronization routines should have the same
>>>>>> outcome when RMA PUT and GET calls are substituted by stores and
>>>>>> loads.
>>>>>> 
>>>>>> The outcome for the flush routines is still not defined.
>>>>>> 
>>>>>> It is interesting, because the standard is actually conflicting
>>>>>> on
>>>>>> whether Flush affects load-store.  I find this incredibly
>>>>>> frustrating.
>>>>>> 
>>>>>> Page 450:
>>>>>> 
>>>>>> "Locally completes at the origin all outstanding RMA operations
>>>>>> initiated by the calling process to the target process specified
>>>>>> by
>>>>>> rank on the specified window. For example, after this routine
>>>>>> completes, the user may reuse any buffers provided to put, get,
>>>>>> or
>>>>>> accumulate operations."
>>>>>> 
>>>>>> I do not not think "RMA operations" includes load-store.
>>>>>> 
>>>>>> Page 410:
>>>>>> 
>>>>>> "The consistency of load/store accesses from/to the shared
>>>>>> memory
>>>>>> as
>>>>>> observed by the user program depends on the architecture. A
>>>>>> consistent
>>>>>> view can be created in the unified memory model (see Section
>>>>>> 11.4)
>>>>>> by
>>>>>> utilizing the window synchronization functions (see Section
>>>>>> 11.5)
>>>>>> or
>>>>>> explicitly completing outstanding store accesses (e.g., by
>>>>>> calling
>>>>>> MPI_WIN_FLUSH)."
>>>>>> 
>>>>>> Here it is unambiguously implied that MPI_WIN_FLUSH affects
>>>>>> load-stores.
>>>>>> 
>>>>>> My preference is to fix the statement on 410 since it is less
>>>>>> canonical than the one on 450, and because I do not want to have
>>>>>> a
>>>>>> memory barrier in every call to WIN_FLUSH.
>>>>>> 
>>>>>> Jeff
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I would prefere to have an organizer of the discussion inside of
>>>>>> the RMA subgroup that proposed the changes for MPI-3.1
>>>>>> rather that I'm the organizer.
>>>>>> I tried to bring all the input together and hope that #456
>>>>>> is now state that it is consistent itsself and with the
>>>>>> expectations expressed by the group that published the
>>>>>> paper at EuroMPI on first usage of this shared memory interface.
>>>>>> 
>>>>>> The ticket is (together with the help of recent C11
>>>>>> standadization)
>>>>>> on a good way to be also consistent with the compiler
>>>>>> optimizations -
>>>>>> in other words - the C standardization body has learnt from the
>>>>>> pthreads problems. Fortran is still an open question to me,
>>>>>> i.e., I do not know the status, see
>>>>>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/456#comment:13
>>>>>> 
>>>>>> Best regards
>>>>>> Rolf
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ----- Original Message -----
>>>>>> 
>>>>>> 
>>>>>> From: "William Gropp" <wgropp at illinois.edu>
>>>>>> To: "MPI WG Remote Memory Access working group"
>>>>>> <mpiwg-rma at lists.mpi-forum.org>
>>>>>> Sent: Thursday, September 25, 2014 4:19:14 PM
>>>>>> Subject: [mpiwg-rma] MPI RMA status summary
>>>>>> 
>>>>>> I looked through all of the tickets and wrote a summary of the
>>>>>> open
>>>>>> issues, which I’ve attached.  I propose the following:
>>>>>> 
>>>>>> Determine which of these issues can be resolved by email.  A
>>>>>> significant number can probably be closed with no further
>>>>>> action.
>>>>>> 
>>>>>> For those requiring rework, determine if there is still interest
>>>>>> in
>>>>>> them, and if not, close them as well.
>>>>>> 
>>>>>> For the ones requiring discussion, assign someone to organize a
>>>>>> position and discussion.  We can schedule telecons to go over
>>>>>> those
>>>>>> issues.  The first item in the list is certainly in this class.
>>>>>> 
>>>>>> Comments?
>>>>>> 
>>>>>> Bill
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>>> 
>>>>>> --
>>>>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>>>>>> rabenseifner at hlrs.de
>>>>>> High Performance Computing Center (HLRS) . phone
>>>>>> ++49(0)711/685-65530
>>>>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>>>>>> 685-65832
>>>>>> Head of Dpmt Parallel Computing . . .
>>>>>> www.hlrs.de/people/rabenseifner
>>>>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>>>>>> 1.307)
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> jeff.science at gmail.com
>>>>>> http://jeffhammond.github.io/
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpiwg-rma mailing list
>>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>>> 
>>>>> --
>>>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>>>>> rabenseifner at hlrs.de
>>>>> High Performance Computing Center (HLRS) . phone
>>>>> ++49(0)711/685-65530
>>>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>>>>> 685-65832
>>>>> Head of Dpmt Parallel Computing . . .
>>>>> www.hlrs.de/people/rabenseifner
>>>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>>>>> 1.307)
>>>>> _______________________________________________
>>>>> mpiwg-rma mailing list
>>>>> mpiwg-rma at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>>> 
>>>> _______________________________________________
>>>> mpiwg-rma mailing list
>>>> mpiwg-rma at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> 
>>> --
>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>>> rabenseifner at hlrs.de
>>> High Performance Computing Center (HLRS) . phone
>>> ++49(0)711/685-65530
>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>>> 685-65832
>>> Head of Dpmt Parallel Computing . . .
>>> www.hlrs.de/people/rabenseifner
>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>>> 1.307)
>>> _______________________________________________
>>> mpiwg-rma mailing list
>>> mpiwg-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> 
> -- 
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma