[Mpi3-rma] RMA synchronization optimization [was: Updated MPI-3 RMA proposal 1]

Wed Jun 23 08:01:38 CDT 2010

Right, the amount of code to maintain does increase, especially in the case
that nothing is deprecated. My concern is for the performance of "common
use" cases, which I think are where only one synchronization mode is used
(is this not true? are there any "real" codes using this?).

The performance problem is not specifically in fence, but in the overall
synchronization handling. The high-level problem is that a program is
allowed to switch between synchronization modes at-will and without
informing the implementation. Because of this, the implementation has to be
prepared for these synch mode changes, and has to keep extra state
information. Particularly difficult is the interaction between fence and
the other modes, because at the time a fence is called one has no idea what
will follow - it might be RMA calls, LOCK, POST, or START, and even a lock
from a remote origin. The implementation has to be prepared for RMA calls,
but has to also be ready to switch gears and go into a PSCW mode or
LOCK/UNLOCK mode. It also seems to be the case (based on existing tests in
the MPICH releases, at least) that being the target of a passive RMA epoch
(LOCK/UNLOCK) can happen anytime, anywhere, and has to be handled. It is
not at all clear to me as an implementer, and apparently also to some
applications writers, just what is legal and not. This also creates a
surprising large number for "corner cases" and race conditions, all of
which must be handled by keeping more and different state information. This
all adds up to very complex and inefficient code, at least in my
experience/opinion.

Hope this makes my point clearer?

thanks,

_______________________________________________
Douglas Miller                  BlueGene Messaging Development
IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
dougmill at us.ibm.com               Douglas Miller/Rochester/IBM

             Pavan Balaji                                                  
             <balaji at mcs.anl.g                                             
             ov>                                                        To 
             Sent by:                  "MPI 3.0 Remote Memory Access       
             mpi3-rma-bounces@         working group"                      
             lists.mpi-forum.o         <mpi3-rma at lists.mpi-forum.org>      
             rg                                                         cc 

                                                                   Subject 
             06/22/2010 01:49          Re: [Mpi3-rma] RMA synchronization  
             PM                        optimization [was: Updated MPI-3    
                                       RMA proposal 1]                     

             Please respond to                                             
              "MPI 3.0 Remote                                              
               Memory Access                                               
              working group"                                               
             <mpi3-rma at lists.m                                             
               pi-forum.org>                                               

Doug,

This discussion was brought up at the last meeting, but it wasn't clear
to any of us why the current definition of fence was losing performance.

My understanding here is that this is a performance issue, and not a
code complexity issue.

The code complexity issue is not really a good argument, since even if
we added asserts to specify only one synchronization method, you'll
still need to support the most general case where there are no asserts.
So, you'll need to maintain the messy code anyway.

On the other hand, if this is a performance issue, we'll need more
explanation on that.

Thanks,

  -- Pavan

On 06/21/2010 10:11 AM, Douglas Miller wrote:
> Yes, I think what you are saying is what I was trying to say in option
#2.
> I'd prefer an explicit function to change "sync mode" rather than try and
> overload fence. But I guess I can see the logic is using fence for this
> purpose as well. Part of the problem I had with MPI2 RMA was the
ambiguity
> of fence, though.
>
> Perhaps I was reading more into the proposal, but I thought things like
> "lockall" and/or "alllockall" could conceptually replace fence. I would
> assume that if a platform had hardware-assist for MPI_Win_fence that it
> also could be used for lockall/alllockall, but maybe that is a stretch.
>
>
> _______________________________________________
> Douglas Miller                  BlueGene Messaging Development
> IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
>
>
>

>              William Gropp

>              <wgropp at illinois.

>              edu>
To
>              Sent by:                  "MPI 3.0 Remote Memory Access

>              mpi3-rma-bounces@         working group"

>              lists.mpi-forum.o         <mpi3-rma at lists.mpi-forum.org>

>              rg
cc
>

>
Subject
>              06/21/2010 08:49          Re: [Mpi3-rma] RMA synchronization

>              AM                        optimization [was: Updated MPI-3

>                                        RMA proposal 1]

>

>              Please respond to

>               "MPI 3.0 Remote

>                Memory Access

>               working group"

>              <mpi3-rma at lists.m

>                pi-forum.org>

>

>

>
>
>
>
> I believe that the original motivation for permitting the mixed sync
> model was for applications that did something like this:
>
> # Initialize a global data area
> fence
> various put or accumulate updates
> fence
>
> # passive-target access of the area
> various lock/get/unlock accesses
>
> Another option would be to require an explicit and collective change
> to the sync mode - as there is only one passive target mode, and the
> "scalable sync" mode in practice involves all processes, this would be
> possible.  An info (already (mis)used for the no_locks property could
> be used with win_create to specify that all changes in sync mode would
> be signaled with a routine (either win_fence with an assert about sync
> mode changing or a new win_sync_mode routine).
>
> Would something like that address the implementation issues that you
> see (remembering that some systems provided special hardware for a
> fast win_fence, and a single sync model probably isn't sufficient)?
>
> Bill
>
>
> On Jun 21, 2010, at 7:38 AM, Douglas Miller wrote:
>
>> At the risk of prolonging an already difficult-to-follow set of e-mail
>> threads, I have to re-iterate my concerns for implementation
>> efficiency.
>>
>> The impediment I see to creating efficient implementations of MPI
>> RMA is
>> that the synchronization primitives provide too much freedom. By
>> allowing
>> one to switch back and forth between different synchronization
>> methods on
>> the same window, an implementation must keep track of more state
>> information and handle complex corner-cases, all of which precludes an
>> optimized implementation. It's been my (admittedly limited)
>> experience that
>> the synchronization adds a significant overhead, and that overhead is
>> largely due to the handling of state and special cases involving mixed
>> synchronization methods. I have two high-level suggestions, listed
>> in the
>> order I prefer them:
>>
>> 1. Leave MPI2 One-Sided as-is (and hope to deprecate it someday),
>> create a
>> new and separate RMA scheme which is intended to replace the old,
>> which
>> uses a single synchronization method (say, the *lock* methods being
>> proposed). I prefer this path because it gives us the flexibility to
>> design
>> exactly what we want without being tied to the previous, possibly
>> flawed,
>> design. There is, admittedly, extra work involved with have two sets
>> of
>> APIs, but I think there is some room for re-use and common code, and
>> I feel
>> the extra work is worth the benefit.
>>
>> 2. Augment the MPI2 One-Sided specification with the ability for the
>> user
>> to specify a single synchronization method to be used exclusively on a
>> given window. This could be by adding Win_create/allocate functions
>> that
>> take an "assert" which specifies the synchronization method to be
>> used,
>> and/or a way to specify "eras" of epochs that will use a single
>> synchronization method - for example, a program can declare at some
>> point
>> that a given window will use only lock/unlock until the next
>> declaration
>> call (specifying another synchronization method, or "all"). At least
>> with
>> such capabilities, an implementation could allow programs to be more
>> efficient if they choose to take the optimization of using a single
>> synchronization method. I know that Win_create does not currently have
>> asserts, but there is a way to add a new function for creating
>> windows that
>> does have asserts (and ensure Win_allocate also has asserts) and then
>> define that the current Win_create is equivalent to Win_create_assert
>> (e.g.) with "asserts" set to zero. Depending on the asserts defined,
>> of
>> course, that should allow the existing Win_create to maintain backward
>> compatibility with MPI2.
>>
>> thanks,
>> _______________________________________________
>> Douglas Miller                  BlueGene Messaging Development
>> IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
>> dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
>>
>>
>>
>>             William Gropp
>>             <wgropp at illinois.
>>
>> edu>                                                       To
>>             Sent by:                  "MPI 3.0 Remote Memory Access
>>             mpi3-rma-bounces@         working group"
>>             lists.mpi-forum.o         <mpi3-rma at lists.mpi-forum.org>
>>
>> rg                                                         cc
>>
>>
>> Subject
>>             06/21/2010 12:05          Re: [Mpi3-rma] Updated MPI-3 RMA
>>             AM                        proposal 1
>>
>>
>>             Please respond to
>>              "MPI 3.0 Remote
>>               Memory Access
>>              working group"
>>             <mpi3-rma at lists.m
>>               pi-forum.org>
>>
>>
>>
>>
>>
>>
>> I agree with Rajeev.  And I think we strayed somewhat from the
>> original plan.
>>
>> The goal for the MPI RMA was to make enlarge the set of applications
>> that could be efficiently implemented with a one-sided model.  The
>> current model *is* a good one for *some* applications; the complaints
>> about it are often because it doesn't fit some other application.  The
>> RMA extensions for MPI-3, in my mind, needed to address *some*, not
>> all, of the important application areas that the current model does
>> not handle.  It is interesting to consider whether a reasonable
>> functional implementation of, say, UPC, could be implemented with it,
>> but I do not see the MPI RMA as supplying the universal implementation
>> layer for other programming models.
>>
>> Making MPI RMA suitable for implementing all other parallel
>> programming models will require more than I think we want to do - look
>> at the short word and aligned move routines in GASNET as an example.
>> You don't need these functionally, but you may need them for
>> performance.
>>
>> That's why we asked for *application* use cases - those would drive
>> the design.  We have a few but we haven't focused on them as much as I
>> think we should.  What I wanted in proposal 1 was a set of operations,
>> each of which (a) was consistent with the others and (b) was clearly
>> driven by some *application* need (where an application in this case
>> is *not* implementing another programming model).  This very well may
>> have required some options to deal with things like selecting
>> different ordering and overlapping update semantics, though that could
>> be very coarse grained.
>>
>> Note that in this interpretation, proposal 1 is not a "bare minimum";
>> rather, it is a consensus collection of consistent extensions that
>> enlarge the space of applications that can be efficiently coded using
>> MPI-3 one-sided.  It will leave some useful features out and some
>> programming models should focus on interoperability with MPI rather
>> than having MPI-3 RMA provide the specific features that they need.
>> It is fine if there is something important that can't be done with
>> MPI-3 RMA, as long as there are other important things that can be
>> done with it.
>>
>> Bill
>>
>> On Jun 20, 2010, at 6:03 PM, Rajeev Thakur wrote:
>>
>>> Are you refering to Accumulate_get :-)? Maybe it should be in
>>> Proposal
>>> 2.
>>>
>>> Maybe we also need a "journal of development" as in MPI-2 :-).
>>>
>>> But, seriously, we need to present a united front at least in
>>> proposal
>>> 1. Otherwise the Forum will have no confidence in us.
>>>
>>> Rajeev
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: mpi3-rma-bounces at lists.mpi-forum.org
>>>> [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of
>>>> Pavan Balaji
>>>> Sent: Sunday, June 20, 2010 5:57 PM
>>>> To: MPI 3.0 Remote Memory Access working group
>>>> Subject: Re: [Mpi3-rma] Updated MPI-3 RMA proposal 1
>>>>
>>>>
>>>> On 06/20/2010 05:48 PM, Rajeev Thakur wrote:
>>>>> Proposal 1: This is what the RMA experts agree is the bare minimum
>>>>> needed to fix what is considered broken in MPI-2 RMA.
>>>> I don't agree that whatever is there in proposal 1 is the
>>>> "bare minimum". Maybe this policy should be reworded as:
>>>> *all* members of the working group should agree that this is needed.
>>>>
>>>> This makes both proposal 1 and proposal 2 contain random
>>>> pieces of unrelated features, though.
>>>>
>>>> -- Pavan
>>>>
>>>> --
>>>> Pavan Balaji
>>>> http://www.mcs.anl.gov/~balaji
>>>> _______________________________________________
>>>> mpi3-rma mailing list
>>>> mpi3-rma at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>>>
>>> _______________________________________________
>>> mpi3-rma mailing list
>>> mpi3-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>> William Gropp
>> Deputy Director for Research
>> Institute for Advanced Computing Applications and Technologies
>> Paul and Cynthia Saylor Professor of Computer Science
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> William Gropp
> Deputy Director for Research
> Institute for Advanced Computing Applications and Technologies
> Paul and Cynthia Saylor Professor of Computer Science
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma