[mpiwg-rma] MPI RMA status summary

Jeff Hammond jeff.science at gmail.com
Tue Sep 30 11:04:19 CDT 2014


If we require a memory barrier in Flush_local, then any implementation
of SHMEM over MPI-3 is going to suck and we have failed in one of the
goals of MPI-3 RMA.

See https://github.com/jeffhammond/oshmpi/blob/master/src/shmem-internals.c#L517
for why.

Jeff

On Tue, Sep 30, 2014 at 7:35 AM, Rolf Rabenseifner <rabenseifner at hlrs.de> wrote:
> There are different ways to look at it.
> Remote memory access like remote load and store can
> be interpreted as RMA by one reader and non-RMA by another one.
>
> Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett,
> Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur
> in
>  "MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared Memory"
>  at EuroMPI 2012
> obviously treated them as RMA, because otherwise their store+Fence+load would
> have been wrong.
>
> The major reason for going with b) and not with a) is simple:
>
> If somenbody does not want to have a shared memory
> synchronization semantics for an existing MPI_Win
> synchronization routine (e.g. MPI_Win_fence), then
> he/she Need not to use the Routine together with
> shared memory windows.
> If he/she wants to stay with MP_Put/Get,
> then he/she needs not to use MPI_Win_allocate_shared.
> It is allowed to use MPI_Win_allocate on a shered memory node
> and MPI internally can do the same optimizations.
>
> Therefore shared memory synchronization semantics
> for MPI_Win_fence, MPI_Win_post/start/complete/wait,
> MPI_Win_lock/unlock, ...
> is never a drawback.
>
> But it is a clear advantage, because process-to-process
> synchronization is combined with local memory synchronization
> which may be implemented faster than if seperated into
> different routines.
>
> Rolf
>
>
> ----- Original Message -----
>> From: "Jeff Hammond" <jeff.science at gmail.com>
>> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
>> Sent: Tuesday, September 30, 2014 4:09:02 PM
>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>>
>> Option A is what the standard says today outside of a sloppy offhand
>> remark in parentheses. See my note please.
>>
>> Jeff
>>
>> Sent from my iPhone
>>
>> > On Sep 30, 2014, at 7:05 AM, Rolf Rabenseifner
>> > <rabenseifner at hlrs.de> wrote:
>> >
>> > I strongly agree with your statement:
>> >> These piecemeal changes are one of the sources of our problems.
>> >
>> > I only wanted to strongly say, that I would never vote for your a),
>> > because it is not backward compatible to what is already used.
>> > And with b) I've the problem that b1) is clear to me (see #456),
>> > but the Win_flush semantics for load/store is unclear to me.
>> >
>> > Of course, a total solution is needed and not parts of it.
>> > #456 is such a trial for a complete solution.
>> >
>> > Rolf
>> >
>> > ----- Original Message -----
>> >> From: "William Gropp" <wgropp at illinois.edu>
>> >> To: "MPI WG Remote Memory Access working group"
>> >> <mpiwg-rma at lists.mpi-forum.org>
>> >> Sent: Tuesday, September 30, 2014 3:19:06 PM
>> >> Subject: Re: [mpiwg-rma] MPI RMA status summary
>> >>
>> >> I disagree with this approach.  The most important thing to do is
>> >> to
>> >> figure out the correct definitions and semantics.  Once we agree
>> >> on
>> >> that, we can determine what can be handled as an errata and what
>> >> will require an update to the chapter and an update to the MPI
>> >> standard.  These piecemeal changes are one of the sources of our
>> >> problems.
>> >>
>> >> Bill
>> >>
>> >> On Sep 30, 2014, at 7:38 AM, Rolf Rabenseifner
>> >> <rabenseifner at hlrs.de>
>> >> wrote:
>> >>
>> >>>> Vote for 1 of the following:
>> >>>>
>> >>>> a) Only Win_sync provides memory barrier semantics to shared
>> >>>> memory
>> >>>> windows
>> >>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
>> >>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
>> >>>> c) Some as yet undetermined blend of a and b, which might
>> >>>> include
>> >>>> additional asserts
>> >>>> d) This topic needs further discussion
>> >>>
>> >>> Because we only have to clarify MPI-3.0 (this is an errata issue)
>> >>> and
>> >>> - obviously the MPI Forum and the readers expected that
>> >>> MPI_Win_fence
>> >>> (and therefore also the other MPI-2 synchronizations
>> >>> MPI_Win_post/start/complete/wait and MPI_Win_lock/unlock)
>> >>> works if MPI_Get/Put are sustituted by shared memory load/store
>> >>> (see the many Forum members as authors of the EuroMPI paper)
>> >>> - and the Forum decided that also MPI_Win_sync acts as if
>> >>> a memory barrier is inside,
>> >>> for me,
>> >>> - a) cannot be chosen bacause an erratum cannot remove
>> >>>    a given functionality
>> >>> - and b) is automatically given, see reasons above. Therefore
>> >>> #456.
>> >>>
>> >>> The only open question for me is about the meaning of
>> >>> MPI_Win_flush.
>> >>> Therefore MPI_Win_flush is still missing in #456.
>> >>>
>> >>> Therefore for me, the Major choices seems to be
>> >>> b1) MPI-2 synchronizations + MPI_Win_sync
>> >>> b2) MPI-2 synchronizations + MPI_Win_sync + MPI_Win_flush
>> >>>
>> >>> For this vote, I clearly want to see a clear proposal
>> >>> about the meaning of MPI_Win_flush together with
>> >>> sared memory load/store, hopefully with the notation
>> >>> used in #456.
>> >>>
>> >>> Best regards
>> >>> Rolf
>> >>>
>> >>> ----- Original Message -----
>> >>>> From: "William Gropp" <wgropp at illinois.edu>
>> >>>> To: "MPI WG Remote Memory Access working group"
>> >>>> <mpiwg-rma at lists.mpi-forum.org>
>> >>>> Sent: Monday, September 29, 2014 11:39:51 PM
>> >>>> Subject: Re: [mpiwg-rma] MPI RMA status summary
>> >>>>
>> >>>>
>> >>>> Thanks, Jeff.
>> >>>>
>> >>>>
>> >>>> I agree that I don’t want load/store to be considered RMA
>> >>>> operations.
>> >>>> But the issue of the memory consistency on RMA synchronization
>> >>>> and
>> >>>> completion operations to a shared memory window is complex.  In
>> >>>> some
>> >>>> ways, the most consistent with RMA in other situations is the
>> >>>> case
>> >>>> of MPI_Win_lock to your own process; the easiest extension for
>> >>>> the
>> >>>> user is to have reasonably strong memory barrier semantics at
>> >>>> all
>> >>>> sync/completion operations (thus including Fence).  As you note,
>> >>>> this has costs.  At the other extreme, we could say that only
>> >>>> Win_sync provides these memory barrier semantics.  And we could
>> >>>> pick
>> >>>> a more complex blend (yes for some, no for others).
>> >>>>
>> >>>>
>> >>>> One of the first questions is whether we want to only Win_sync,
>> >>>> all
>> >>>> completion/sync RMA routines, or some subset to provide memory
>> >>>> barrier semantics for shared memory windows (this would include
>> >>>> RMA
>> >>>> windows that claimed to be shared memory, since there is a
>> >>>> proposal
>> >>>> to extend that property to other RMA windows).  It would be good
>> >>>> to
>> >>>> make progress on this question, so I propose a straw vote of
>> >>>> this
>> >>>> group by email.  Vote for 1 of the following:
>> >>>>
>> >>>>
>> >>>> a) Only Win_sync provides memory barrier semantics to shared
>> >>>> memory
>> >>>> windows
>> >>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
>> >>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
>> >>>> c) Some as yet undetermined blend of a and b, which might
>> >>>> include
>> >>>> additional asserts
>> >>>> d) This topic needs further discussion
>> >>>>
>> >>>>
>> >>>> Note that I’ve left off what “memory barrier semantics” means.
>> >>>> That
>> >>>> will need to be precisely defined for the standard, but I
>> >>>> believe
>> >>>> we
>> >>>> know what we intend for this.  We specifically are not defining
>> >>>> what
>> >>>> happens with non-MPI code.  Also note that this is separate from
>> >>>> whether the RMA sync routines appear to be blocking when applied
>> >>>> to
>> >>>> a shared memory window; we can do a separate straw vote on that.
>> >>>>
>> >>>>
>> >>>> Bill
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Sep 29, 2014, at 3:49 PM, Jeff Hammond <
>> >>>> jeff.science at gmail.com
>> >>>> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Mon, Sep 29, 2014 at 9:16 AM, Rolf Rabenseifner <
>> >>>> rabenseifner at hlrs.de > wrote:
>> >>>>
>> >>>>
>> >>>> Only about the issues on #456 (shared memory syncronization):
>> >>>>
>> >>>>
>> >>>>
>> >>>> For the ones requiring discussion, assign someone to organize a
>> >>>> position and discussion.  We can schedule telecons to go over
>> >>>> those
>> >>>> issues.  The first item in the list is certainly in this class.
>> >>>>
>> >>>> Who can organize telecons on #456.
>> >>>> Would it be possible to organize a RMA meeting at SC?
>> >>>>
>> >>>> I will be there Monday through part of Thursday but am usually
>> >>>> triple-booked from 8 AM to midnight.
>> >>>>
>> >>>>
>> >>>>
>> >>>> The position expressed by the solution #456 is based on the idea
>> >>>> that the MPI RMA synchronization routines should have the same
>> >>>> outcome when RMA PUT and GET calls are substituted by stores and
>> >>>> loads.
>> >>>>
>> >>>> The outcome for the flush routines is still not defined.
>> >>>>
>> >>>> It is interesting, because the standard is actually conflicting
>> >>>> on
>> >>>> whether Flush affects load-store.  I find this incredibly
>> >>>> frustrating.
>> >>>>
>> >>>> Page 450:
>> >>>>
>> >>>> "Locally completes at the origin all outstanding RMA operations
>> >>>> initiated by the calling process to the target process specified
>> >>>> by
>> >>>> rank on the specified window. For example, after this routine
>> >>>> completes, the user may reuse any buffers provided to put, get,
>> >>>> or
>> >>>> accumulate operations."
>> >>>>
>> >>>> I do not not think "RMA operations" includes load-store.
>> >>>>
>> >>>> Page 410:
>> >>>>
>> >>>> "The consistency of load/store accesses from/to the shared
>> >>>> memory
>> >>>> as
>> >>>> observed by the user program depends on the architecture. A
>> >>>> consistent
>> >>>> view can be created in the unified memory model (see Section
>> >>>> 11.4)
>> >>>> by
>> >>>> utilizing the window synchronization functions (see Section
>> >>>> 11.5)
>> >>>> or
>> >>>> explicitly completing outstanding store accesses (e.g., by
>> >>>> calling
>> >>>> MPI_WIN_FLUSH)."
>> >>>>
>> >>>> Here it is unambiguously implied that MPI_WIN_FLUSH affects
>> >>>> load-stores.
>> >>>>
>> >>>> My preference is to fix the statement on 410 since it is less
>> >>>> canonical than the one on 450, and because I do not want to have
>> >>>> a
>> >>>> memory barrier in every call to WIN_FLUSH.
>> >>>>
>> >>>> Jeff
>> >>>>
>> >>>>
>> >>>>
>> >>>> I would prefere to have an organizer of the discussion inside of
>> >>>> the RMA subgroup that proposed the changes for MPI-3.1
>> >>>> rather that I'm the organizer.
>> >>>> I tried to bring all the input together and hope that #456
>> >>>> is now state that it is consistent itsself and with the
>> >>>> expectations expressed by the group that published the
>> >>>> paper at EuroMPI on first usage of this shared memory interface.
>> >>>>
>> >>>> The ticket is (together with the help of recent C11
>> >>>> standadization)
>> >>>> on a good way to be also consistent with the compiler
>> >>>> optimizations -
>> >>>> in other words - the C standardization body has learnt from the
>> >>>> pthreads problems. Fortran is still an open question to me,
>> >>>> i.e., I do not know the status, see
>> >>>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/456#comment:13
>> >>>>
>> >>>> Best regards
>> >>>> Rolf
>> >>>>
>> >>>>
>> >>>>
>> >>>> ----- Original Message -----
>> >>>>
>> >>>>
>> >>>> From: "William Gropp" <wgropp at illinois.edu>
>> >>>> To: "MPI WG Remote Memory Access working group"
>> >>>> <mpiwg-rma at lists.mpi-forum.org>
>> >>>> Sent: Thursday, September 25, 2014 4:19:14 PM
>> >>>> Subject: [mpiwg-rma] MPI RMA status summary
>> >>>>
>> >>>> I looked through all of the tickets and wrote a summary of the
>> >>>> open
>> >>>> issues, which I’ve attached.  I propose the following:
>> >>>>
>> >>>> Determine which of these issues can be resolved by email.  A
>> >>>> significant number can probably be closed with no further
>> >>>> action.
>> >>>>
>> >>>> For those requiring rework, determine if there is still interest
>> >>>> in
>> >>>> them, and if not, close them as well.
>> >>>>
>> >>>> For the ones requiring discussion, assign someone to organize a
>> >>>> position and discussion.  We can schedule telecons to go over
>> >>>> those
>> >>>> issues.  The first item in the list is certainly in this class.
>> >>>>
>> >>>> Comments?
>> >>>>
>> >>>> Bill
>> >>>>
>> >>>> _______________________________________________
>> >>>> mpiwg-rma mailing list
>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>>
>> >>>> --
>> >>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>> >>>> rabenseifner at hlrs.de
>> >>>> High Performance Computing Center (HLRS) . phone
>> >>>> ++49(0)711/685-65530
>> >>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>> >>>> 685-65832
>> >>>> Head of Dpmt Parallel Computing . . .
>> >>>> www.hlrs.de/people/rabenseifner
>> >>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>> >>>> 1.307)
>> >>>> _______________________________________________
>> >>>> mpiwg-rma mailing list
>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Jeff Hammond
>> >>>> jeff.science at gmail.com
>> >>>> http://jeffhammond.github.io/
>> >>>> _______________________________________________
>> >>>> mpiwg-rma mailing list
>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>>
>> >>>> _______________________________________________
>> >>>> mpiwg-rma mailing list
>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>
>> >>> --
>> >>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
>> >>> rabenseifner at hlrs.de
>> >>> High Performance Computing Center (HLRS) . phone
>> >>> ++49(0)711/685-65530
>> >>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>> >>> 685-65832
>> >>> Head of Dpmt Parallel Computing . . .
>> >>> www.hlrs.de/people/rabenseifner
>> >>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>> >>> 1.307)
>> >>> _______________________________________________
>> >>> mpiwg-rma mailing list
>> >>> mpiwg-rma at lists.mpi-forum.org
>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>
>> >> _______________________________________________
>> >> mpiwg-rma mailing list
>> >> mpiwg-rma at lists.mpi-forum.org
>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >
>> > --
>> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
>> > rabenseifner at hlrs.de
>> > High Performance Computing Center (HLRS) . phone
>> > ++49(0)711/685-65530
>> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
>> > 685-65832
>> > Head of Dpmt Parallel Computing . . .
>> > www.hlrs.de/people/rabenseifner
>> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
>> > 1.307)
>> > _______________________________________________
>> > mpiwg-rma mailing list
>> > mpiwg-rma at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
> --
> Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
> High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
> University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
> Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/



More information about the mpiwg-rma mailing list