[mpiwg-rma] MPI RMA status summary

Rolf Rabenseifner rabenseifner at hlrs.de
Tue Sep 30 09:35:53 CDT 2014


There are different ways to look at it.
Remote memory access like remote load and store can 
be interpreted as RMA by one reader and non-RMA by another one.

Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, 
Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur
in 
 "MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared Memory" 
 at EuroMPI 2012
obviously treated them as RMA, because otherwise their store+Fence+load would
have been wrong.

The major reason for going with b) and not with a) is simple:

If somenbody does not want to have a shared memory 
synchronization semantics for an existing MPI_Win 
synchronization routine (e.g. MPI_Win_fence), then
he/she Need not to use the Routine together with 
shared memory windows.
If he/she wants to stay with MP_Put/Get,
then he/she needs not to use MPI_Win_allocate_shared.
It is allowed to use MPI_Win_allocate on a shered memory node
and MPI internally can do the same optimizations.

Therefore shared memory synchronization semantics
for MPI_Win_fence, MPI_Win_post/start/complete/wait, 
MPI_Win_lock/unlock, ...
is never a drawback. 

But it is a clear advantage, because process-to-process
synchronization is combined with local memory synchronization
which may be implemented faster than if seperated into
different routines.

Rolf 


----- Original Message -----
> From: "Jeff Hammond" <jeff.science at gmail.com>
> To: "MPI WG Remote Memory Access working group" <mpiwg-rma at lists.mpi-forum.org>
> Sent: Tuesday, September 30, 2014 4:09:02 PM
> Subject: Re: [mpiwg-rma] MPI RMA status summary
> 
> Option A is what the standard says today outside of a sloppy offhand
> remark in parentheses. See my note please.
> 
> Jeff
> 
> Sent from my iPhone
> 
> > On Sep 30, 2014, at 7:05 AM, Rolf Rabenseifner
> > <rabenseifner at hlrs.de> wrote:
> > 
> > I strongly agree with your statement:
> >> These piecemeal changes are one of the sources of our problems.
> > 
> > I only wanted to strongly say, that I would never vote for your a),
> > because it is not backward compatible to what is already used.
> > And with b) I've the problem that b1) is clear to me (see #456),
> > but the Win_flush semantics for load/store is unclear to me.
> > 
> > Of course, a total solution is needed and not parts of it.
> > #456 is such a trial for a complete solution.
> > 
> > Rolf
> > 
> > ----- Original Message -----
> >> From: "William Gropp" <wgropp at illinois.edu>
> >> To: "MPI WG Remote Memory Access working group"
> >> <mpiwg-rma at lists.mpi-forum.org>
> >> Sent: Tuesday, September 30, 2014 3:19:06 PM
> >> Subject: Re: [mpiwg-rma] MPI RMA status summary
> >> 
> >> I disagree with this approach.  The most important thing to do is
> >> to
> >> figure out the correct definitions and semantics.  Once we agree
> >> on
> >> that, we can determine what can be handled as an errata and what
> >> will require an update to the chapter and an update to the MPI
> >> standard.  These piecemeal changes are one of the sources of our
> >> problems.
> >> 
> >> Bill
> >> 
> >> On Sep 30, 2014, at 7:38 AM, Rolf Rabenseifner
> >> <rabenseifner at hlrs.de>
> >> wrote:
> >> 
> >>>> Vote for 1 of the following:
> >>>> 
> >>>> a) Only Win_sync provides memory barrier semantics to shared
> >>>> memory
> >>>> windows
> >>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
> >>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
> >>>> c) Some as yet undetermined blend of a and b, which might
> >>>> include
> >>>> additional asserts
> >>>> d) This topic needs further discussion
> >>> 
> >>> Because we only have to clarify MPI-3.0 (this is an errata issue)
> >>> and
> >>> - obviously the MPI Forum and the readers expected that
> >>> MPI_Win_fence
> >>> (and therefore also the other MPI-2 synchronizations
> >>> MPI_Win_post/start/complete/wait and MPI_Win_lock/unlock)
> >>> works if MPI_Get/Put are sustituted by shared memory load/store
> >>> (see the many Forum members as authors of the EuroMPI paper)
> >>> - and the Forum decided that also MPI_Win_sync acts as if
> >>> a memory barrier is inside,
> >>> for me,
> >>> - a) cannot be chosen bacause an erratum cannot remove
> >>>    a given functionality
> >>> - and b) is automatically given, see reasons above. Therefore
> >>> #456.
> >>> 
> >>> The only open question for me is about the meaning of
> >>> MPI_Win_flush.
> >>> Therefore MPI_Win_flush is still missing in #456.
> >>> 
> >>> Therefore for me, the Major choices seems to be
> >>> b1) MPI-2 synchronizations + MPI_Win_sync
> >>> b2) MPI-2 synchronizations + MPI_Win_sync + MPI_Win_flush
> >>> 
> >>> For this vote, I clearly want to see a clear proposal
> >>> about the meaning of MPI_Win_flush together with
> >>> sared memory load/store, hopefully with the notation
> >>> used in #456.
> >>> 
> >>> Best regards
> >>> Rolf
> >>> 
> >>> ----- Original Message -----
> >>>> From: "William Gropp" <wgropp at illinois.edu>
> >>>> To: "MPI WG Remote Memory Access working group"
> >>>> <mpiwg-rma at lists.mpi-forum.org>
> >>>> Sent: Monday, September 29, 2014 11:39:51 PM
> >>>> Subject: Re: [mpiwg-rma] MPI RMA status summary
> >>>> 
> >>>> 
> >>>> Thanks, Jeff.
> >>>> 
> >>>> 
> >>>> I agree that I don’t want load/store to be considered RMA
> >>>> operations.
> >>>> But the issue of the memory consistency on RMA synchronization
> >>>> and
> >>>> completion operations to a shared memory window is complex.  In
> >>>> some
> >>>> ways, the most consistent with RMA in other situations is the
> >>>> case
> >>>> of MPI_Win_lock to your own process; the easiest extension for
> >>>> the
> >>>> user is to have reasonably strong memory barrier semantics at
> >>>> all
> >>>> sync/completion operations (thus including Fence).  As you note,
> >>>> this has costs.  At the other extreme, we could say that only
> >>>> Win_sync provides these memory barrier semantics.  And we could
> >>>> pick
> >>>> a more complex blend (yes for some, no for others).
> >>>> 
> >>>> 
> >>>> One of the first questions is whether we want to only Win_sync,
> >>>> all
> >>>> completion/sync RMA routines, or some subset to provide memory
> >>>> barrier semantics for shared memory windows (this would include
> >>>> RMA
> >>>> windows that claimed to be shared memory, since there is a
> >>>> proposal
> >>>> to extend that property to other RMA windows).  It would be good
> >>>> to
> >>>> make progress on this question, so I propose a straw vote of
> >>>> this
> >>>> group by email.  Vote for 1 of the following:
> >>>> 
> >>>> 
> >>>> a) Only Win_sync provides memory barrier semantics to shared
> >>>> memory
> >>>> windows
> >>>> b) All RMA completion/sync routines (e.g., MPI_Win_lock,
> >>>> MPI_Win_fence, MPI_Win_flush) provide memory barrier semantics
> >>>> c) Some as yet undetermined blend of a and b, which might
> >>>> include
> >>>> additional asserts
> >>>> d) This topic needs further discussion
> >>>> 
> >>>> 
> >>>> Note that I’ve left off what “memory barrier semantics” means.
> >>>> That
> >>>> will need to be precisely defined for the standard, but I
> >>>> believe
> >>>> we
> >>>> know what we intend for this.  We specifically are not defining
> >>>> what
> >>>> happens with non-MPI code.  Also note that this is separate from
> >>>> whether the RMA sync routines appear to be blocking when applied
> >>>> to
> >>>> a shared memory window; we can do a separate straw vote on that.
> >>>> 
> >>>> 
> >>>> Bill
> >>>> 
> >>>> 
> >>>> 
> >>>> On Sep 29, 2014, at 3:49 PM, Jeff Hammond <
> >>>> jeff.science at gmail.com
> >>>> wrote:
> >>>> 
> >>>> 
> >>>> 
> >>>> On Mon, Sep 29, 2014 at 9:16 AM, Rolf Rabenseifner <
> >>>> rabenseifner at hlrs.de > wrote:
> >>>> 
> >>>> 
> >>>> Only about the issues on #456 (shared memory syncronization):
> >>>> 
> >>>> 
> >>>> 
> >>>> For the ones requiring discussion, assign someone to organize a
> >>>> position and discussion.  We can schedule telecons to go over
> >>>> those
> >>>> issues.  The first item in the list is certainly in this class.
> >>>> 
> >>>> Who can organize telecons on #456.
> >>>> Would it be possible to organize a RMA meeting at SC?
> >>>> 
> >>>> I will be there Monday through part of Thursday but am usually
> >>>> triple-booked from 8 AM to midnight.
> >>>> 
> >>>> 
> >>>> 
> >>>> The position expressed by the solution #456 is based on the idea
> >>>> that the MPI RMA synchronization routines should have the same
> >>>> outcome when RMA PUT and GET calls are substituted by stores and
> >>>> loads.
> >>>> 
> >>>> The outcome for the flush routines is still not defined.
> >>>> 
> >>>> It is interesting, because the standard is actually conflicting
> >>>> on
> >>>> whether Flush affects load-store.  I find this incredibly
> >>>> frustrating.
> >>>> 
> >>>> Page 450:
> >>>> 
> >>>> "Locally completes at the origin all outstanding RMA operations
> >>>> initiated by the calling process to the target process specified
> >>>> by
> >>>> rank on the specified window. For example, after this routine
> >>>> completes, the user may reuse any buffers provided to put, get,
> >>>> or
> >>>> accumulate operations."
> >>>> 
> >>>> I do not not think "RMA operations" includes load-store.
> >>>> 
> >>>> Page 410:
> >>>> 
> >>>> "The consistency of load/store accesses from/to the shared
> >>>> memory
> >>>> as
> >>>> observed by the user program depends on the architecture. A
> >>>> consistent
> >>>> view can be created in the unified memory model (see Section
> >>>> 11.4)
> >>>> by
> >>>> utilizing the window synchronization functions (see Section
> >>>> 11.5)
> >>>> or
> >>>> explicitly completing outstanding store accesses (e.g., by
> >>>> calling
> >>>> MPI_WIN_FLUSH)."
> >>>> 
> >>>> Here it is unambiguously implied that MPI_WIN_FLUSH affects
> >>>> load-stores.
> >>>> 
> >>>> My preference is to fix the statement on 410 since it is less
> >>>> canonical than the one on 450, and because I do not want to have
> >>>> a
> >>>> memory barrier in every call to WIN_FLUSH.
> >>>> 
> >>>> Jeff
> >>>> 
> >>>> 
> >>>> 
> >>>> I would prefere to have an organizer of the discussion inside of
> >>>> the RMA subgroup that proposed the changes for MPI-3.1
> >>>> rather that I'm the organizer.
> >>>> I tried to bring all the input together and hope that #456
> >>>> is now state that it is consistent itsself and with the
> >>>> expectations expressed by the group that published the
> >>>> paper at EuroMPI on first usage of this shared memory interface.
> >>>> 
> >>>> The ticket is (together with the help of recent C11
> >>>> standadization)
> >>>> on a good way to be also consistent with the compiler
> >>>> optimizations -
> >>>> in other words - the C standardization body has learnt from the
> >>>> pthreads problems. Fortran is still an open question to me,
> >>>> i.e., I do not know the status, see
> >>>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/456#comment:13
> >>>> 
> >>>> Best regards
> >>>> Rolf
> >>>> 
> >>>> 
> >>>> 
> >>>> ----- Original Message -----
> >>>> 
> >>>> 
> >>>> From: "William Gropp" <wgropp at illinois.edu>
> >>>> To: "MPI WG Remote Memory Access working group"
> >>>> <mpiwg-rma at lists.mpi-forum.org>
> >>>> Sent: Thursday, September 25, 2014 4:19:14 PM
> >>>> Subject: [mpiwg-rma] MPI RMA status summary
> >>>> 
> >>>> I looked through all of the tickets and wrote a summary of the
> >>>> open
> >>>> issues, which I’ve attached.  I propose the following:
> >>>> 
> >>>> Determine which of these issues can be resolved by email.  A
> >>>> significant number can probably be closed with no further
> >>>> action.
> >>>> 
> >>>> For those requiring rework, determine if there is still interest
> >>>> in
> >>>> them, and if not, close them as well.
> >>>> 
> >>>> For the ones requiring discussion, assign someone to organize a
> >>>> position and discussion.  We can schedule telecons to go over
> >>>> those
> >>>> issues.  The first item in the list is certainly in this class.
> >>>> 
> >>>> Comments?
> >>>> 
> >>>> Bill
> >>>> 
> >>>> _______________________________________________
> >>>> mpiwg-rma mailing list
> >>>> mpiwg-rma at lists.mpi-forum.org
> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>> 
> >>>> --
> >>>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
> >>>> rabenseifner at hlrs.de
> >>>> High Performance Computing Center (HLRS) . phone
> >>>> ++49(0)711/685-65530
> >>>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> >>>> 685-65832
> >>>> Head of Dpmt Parallel Computing . . .
> >>>> www.hlrs.de/people/rabenseifner
> >>>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
> >>>> 1.307)
> >>>> _______________________________________________
> >>>> mpiwg-rma mailing list
> >>>> mpiwg-rma at lists.mpi-forum.org
> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> Jeff Hammond
> >>>> jeff.science at gmail.com
> >>>> http://jeffhammond.github.io/
> >>>> _______________________________________________
> >>>> mpiwg-rma mailing list
> >>>> mpiwg-rma at lists.mpi-forum.org
> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>> 
> >>>> _______________________________________________
> >>>> mpiwg-rma mailing list
> >>>> mpiwg-rma at lists.mpi-forum.org
> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> 
> >>> --
> >>> Dr. Rolf Rabenseifner . . . . . . . . . .. email
> >>> rabenseifner at hlrs.de
> >>> High Performance Computing Center (HLRS) . phone
> >>> ++49(0)711/685-65530
> >>> University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> >>> 685-65832
> >>> Head of Dpmt Parallel Computing . . .
> >>> www.hlrs.de/people/rabenseifner
> >>> Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
> >>> 1.307)
> >>> _______________________________________________
> >>> mpiwg-rma mailing list
> >>> mpiwg-rma at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >> 
> >> _______________________________________________
> >> mpiwg-rma mailing list
> >> mpiwg-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> > 
> > --
> > Dr. Rolf Rabenseifner . . . . . . . . . .. email
> > rabenseifner at hlrs.de
> > High Performance Computing Center (HLRS) . phone
> > ++49(0)711/685-65530
> > University of Stuttgart . . . . . . . . .. fax ++49(0)711 /
> > 685-65832
> > Head of Dpmt Parallel Computing . . .
> > www.hlrs.de/people/rabenseifner
> > Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room
> > 1.307)
> > _______________________________________________
> > mpiwg-rma mailing list
> > mpiwg-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)



More information about the mpiwg-rma mailing list