[Mpi3-rma] Use cases for RMA

Wed Mar 3 17:45:48 CST 2010

Jeff,

Implicity I refer to ARMCI when talking about GA since the memory 
management and communication is all hidden in ARMCI. GA layer is only 
about data management.

> I don't think it makes sense to compare MPI-3 to GA.  We should be
> talking about ARMCI versus MPI-3.  Part of GA is the one-sided
> transport layer and the other part is data management.

>  Sharing
> pointers on a node is something a library layer such as GA could do on
> top of MPI-3, especially since, unlike ARMCI, MPI RMA collective
> registration attaches to existing segments. 

I am not referring to sharing pointers. It is about how the memory is 
allocated (e.g.shared memory).

> Hence the ability to do
> NIC bypass for on-node RMA or direct access to pointers is already
> possible in the context of MPI RMA, albeit with more restrictions on
> the validity of these operations and via a higher-level library on top
> of MPI.  Furthering my point, direct access is not something that
> ARMCI provides - both according to the ARMCI documentation and my
> reading of the GA source - so it does not make sense to talk about
> adding this to MPI-3 since the stated goal for a long time has been to
> have GA run on top of MPI-3 rather than MPI-3 replace GA.

I am thinking more on the line of ARMCI (rather than GA) on top of MPI-3, 
as that is the easiest thing to do (for GA/ARMCI developers).

> 
> Pavan and I debated the having locks optional when not necessary.  I
> do not find this appealing.  It means that code is non-portable, that
> is, I have to #ifdef (in) out locks on (non-)cache-coherent.  It is
> far better for implementations to optimize locks to null operations
> when unnecessary, i.e. cache-coherent with RDMA.

(Based on Bill's earlier message) I agree with this. However, it 
should not come at the cost of not providing truly one-sided semantics
(or implementations).

Thanks,
-Manoj.

> 
> Best,
> 
> Jeff
> 
> On Wed, Mar 3, 2010 at 3:40 PM, Manojkumar Krishnan <manoj at pnl.gov> wrote:
> >
> > Bill,
> >
> > Please scroll for my comments.
> >
> > On Wed, 3 Mar 2010, William Gropp wrote:
> >
> >> Thanks.  I've added some comments inline.
> >>
> >> On Mar 3, 2010, at 1:23 PM, Manojkumar Krishnan wrote:
> >>
> >> >
> >> > Bill,
> >> >
> >> > Here are some of the MPI RMA requirements for Global Arrays (GA) and
> >> > ARMCI. ARMCI is GA's runtime system.  GA/ARMCI exploits native
> >> > network communication interfaces and system resources (such as shared
> >> > memory) to achieve the best possible performance of the remote memory
> >> > access/one-sided communication. GA/ARMCI relies *heavily* on optimized
> >> > contiguous and non-contiguous RMA operations (get/put/acc).
> >> >
> >> > For GA/ARMCI and its applications, below are some specfic examples of
> >> > operations that are hard to achieve in MPI-2 RMA.
> >> >
> >> > 1. Memory Allocation: (This might be an implementation issue) The
> >> > user or
> >> > library implementors should be able to allocate memory (e.g. shared
> >> > memory), and register with MPI. This is useful in case of Global
> >> > Arrays/ARMCI, which use RMA across nodes and shared memory within
> >> > nodes.
> >> > ARMCI allocates shared memory segment, and pins/registers with the
> >> > network.
> >>
> >> MPI already provides MPI_Alloc_mem , though this is not necessarily
> >> shared memory usable by other processes.  But the MPI implementation
> >> could allocate this from a shared memory pool and perform one-sided
> >> operations using load-store.  Do you need something different?
> >>
> >
> > The thing is, this involves additional copy via RMA_Get. Is it
> > possible for other processes within the same SMP node access the memory
> > directly (rather than get/put using load-store).
> >
> > For example something like this. In GA, you get *direct* access
> > (pointer) to memory regions allocated by processes within SMP node. If
> > proc 0 and 1 are in the same node, proc 1 can access proc 0's region
> > without explict get/put (i.e. proc 1 gets a pointer to this region).
> >> >
> >> > 2. Locks: Should be made optional to keep the RMA programming model
> >> > simple. If the user doesnot require concurrency, then locks are
> >> > unnecessary. Enforcing to use locks as default might introduce
> >> > unnecessary bugs if not carefully programmed (esp. at the level of of
> >> > extreme scale systems, it is hard to debug).
> >> >
> >>
> >> You need something to provide for local and remote completion.  Do you
> >> mean to have blocking RMA operations that (in MPI terms) essentially
> >> act as lock-RMA_operation-unlock?
> >
> > Not really, I meant - performing RMA operations without locks, if locks
> > are not needed. Instead of making it default, we can have locks as optional.
> >
> > With respect to completion and synchronization, here is an example of how
> > it is done in ARMCI.
> >
> > ARMCI_Put: Ensures local completion and source buffer is safe to reuse
> >
> > ARMCI_NbPut: Non-blocking put. User has to call wait() to reuse source
> > buffer.
> >
> > ARMCI_Put(dst_proc)+ARMCI_Fence(dst_proc): Ensures local and remote
> > completion.
> >
> > ARMCI_Allfence. This is collective. Ensures all outstanding RMA
> > operations are completed on the remote side. Mostly called with Barrier.
> >
> > ARMCI_Put_Notify/Wait. This is like light-weight message passing. After a
> > (large) put, the sender can send a notify message, and the destination
> > process can wait on it. This is useful if send/recv is in redevous
> > mode and send essentially blocks for a slow receiver. In case of
> > Put/Put_notify, the sender can issue 2 seperate R(D)MA's and no need to
> > handshake with the receiver. Thus it is relativly less synchronous. On the
> > other hand, if the sender is slow, then the receiver has to wait!
> >
> > Please note that, if such a synchronization/completion is required
> > between source and destination, then user should avoid using RMA (and use
> > Send/Recv), as this is truly 2-sided communication.
> >
> >>
> >> > 3. Support for non-overlapping concurrent operations in a window.
> >>
> >> MPI RMA already does this, in all of the active target modes, and in
> >> the passive target mode with the shared lock mode.  What do you need
> >> that isn't supported?
> >
> > In the passive target mode, does this require some synchronization with
> > the source and destination process if they are accessing non-overlapping
> > operations in the same window. If so, then this might not be truly
> > one-sided operation. Please correct me if I am wrong.
> >
> >>
> >> >
> >> > 4. RMW Operation - Useful for implementing dynamic load balancing
> >> > algorithms (e.g. task queues/work stealing, group/global counters,
> >> > etc).
> >> >
> >>
> >> Yes, this is a major gap.  Is fetch-and-increment enough, or do you
> >> need compare-and-swap, or a general RMI?
> >
> > For ARMCI, we need both: fetch-and-increment and compare-and-swap. A
> > general RMI is certainly useful, as we have plans to extend the capability
> > of ARMCI to perform general RMI.
> >
> > --
> > Thanks,
> > -Manoj.
> > ---------------------------------------------------------------
> > Manojkumar Krishnan
> > High Performance Computing Group
> > Pacific Northwest National Laboratory
> > Ph: (509) 372-4206   Fax: (509) 372-4720
> > http://hpc.pnl.gov/people/manoj
> > ---------------------------------------------------------------
> >
> >
> >>
> >> Bill
> >>
> >> > The above are feature requirements rather than performance issues,
> >> > which
> >> > are implementation specific.
> >> >
> >> > If you have any questions, I would be happy to explain the above in
> >> > detail.
> >> >
> >> > Thanks,
> >> > -Manoj.
> >> > ---------------------------------------------------------------
> >> > Manojkumar Krishnan
> >> > High Performance Computing Group
> >> > Pacific Northwest National Laboratory
> >> > Ph: (509) 372-4206   Fax: (509) 372-4720
> >> > http://hpc.pnl.gov/people/manoj
> >> > ---------------------------------------------------------------
> >> >
> >> > On Wed, 3 Mar 2010, William Gropp wrote:
> >> >
> >> >> I went through the mail that was sent out in response to our request
> >> >> for use cases, and I must say it was underwhelming.  I've included a
> >> >> short summary below; based on this, we aren't looking at the correct
> >> >> needs.  I don't think that these are representative of *all* of the
> >> >> needs of RMA, but without good use cases, I don't see how we can
> >> >> justify any but the most limited extensions/changes to the current
> >> >> design.  Please (a) let me know if I overlooked something and (b)
> >> >> send
> >> >> me (and the list) additional use cases.  For example, we haven't
> >> >> included any of the issues needed to implement PGAS languages, nor
> >> >> have we addressed the existing SHMEM codes.  Do we simply say that a
> >> >> high-quality implementation will permit interoperation with whatever
> >> >> OpenSHMEM is?  And what do we do about the RMI issue that one of the
> >> >> two use cases that we have raises?
> >> >>
> >> >> Basically, I received two detailed notes for the area of Quantum
> >> >> Chemistry.  In brief:
> >> >>
> >> >> MPI-2 RMA is already adequate for most parts, as long as the
> >> >> implementation makes progress (as it is required to) on passive
> >> >> updates.  Requires get or accumulate; rarely requires put.
> >> >> Dynamic load balancing (a related but separate issue) needs some sort
> >> >> of RMW.  (Thanks to Jeff Hammond)
> >> >>
> >> >> More complex algorithms and data structures appear to require remote
> >> >> method invocation (RMI).  Sparse tree updates provide one example.
> >> >> (Thanks to Robert Harrison)
> >> >>
> >> >> Bill
> >> >>
> >> >>
> >> >> William Gropp
> >> >> Deputy Director for Research
> >> >> Institute for Advanced Computing Applications and Technologies
> >> >> Paul and Cynthia Saylor Professor of Computer Science
> >> >> University of Illinois Urbana-Champaign
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> mpi3-rma mailing list
> >> >> mpi3-rma at lists.mpi-forum.org
> >> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >> >>
> >> > _______________________________________________
> >> > mpi3-rma mailing list
> >> > mpi3-rma at lists.mpi-forum.org
> >> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >>
> >> William Gropp
> >> Deputy Director for Research
> >> Institute for Advanced Computing Applications and Technologies
> >> Paul and Cynthia Saylor Professor of Computer Science
> >> University of Illinois Urbana-Champaign
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> mpi3-rma mailing list
> >> mpi3-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >>
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> 
> 
> 
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> jhammond at mcs.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> 
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>