[Mpi3-rma] Use cases for RMA
manoj at pnl.gov
Wed Mar 3 15:40:24 CST 2010
Please scroll for my comments.
On Wed, 3 Mar 2010, William Gropp wrote:
> Thanks. I've added some comments inline.
> On Mar 3, 2010, at 1:23 PM, Manojkumar Krishnan wrote:
> > Bill,
> > Here are some of the MPI RMA requirements for Global Arrays (GA) and
> > ARMCI. ARMCI is GA's runtime system. GA/ARMCI exploits native
> > network communication interfaces and system resources (such as shared
> > memory) to achieve the best possible performance of the remote memory
> > access/one-sided communication. GA/ARMCI relies *heavily* on optimized
> > contiguous and non-contiguous RMA operations (get/put/acc).
> > For GA/ARMCI and its applications, below are some specfic examples of
> > operations that are hard to achieve in MPI-2 RMA.
> > 1. Memory Allocation: (This might be an implementation issue) The
> > user or
> > library implementors should be able to allocate memory (e.g. shared
> > memory), and register with MPI. This is useful in case of Global
> > Arrays/ARMCI, which use RMA across nodes and shared memory within
> > nodes.
> > ARMCI allocates shared memory segment, and pins/registers with the
> > network.
> MPI already provides MPI_Alloc_mem , though this is not necessarily
> shared memory usable by other processes. But the MPI implementation
> could allocate this from a shared memory pool and perform one-sided
> operations using load-store. Do you need something different?
The thing is, this involves additional copy via RMA_Get. Is it
possible for other processes within the same SMP node access the memory
directly (rather than get/put using load-store).
For example something like this. In GA, you get *direct* access
(pointer) to memory regions allocated by processes within SMP node. If
proc 0 and 1 are in the same node, proc 1 can access proc 0's region
without explict get/put (i.e. proc 1 gets a pointer to this region).
> > 2. Locks: Should be made optional to keep the RMA programming model
> > simple. If the user doesnot require concurrency, then locks are
> > unnecessary. Enforcing to use locks as default might introduce
> > unnecessary bugs if not carefully programmed (esp. at the level of of
> > extreme scale systems, it is hard to debug).
> You need something to provide for local and remote completion. Do you
> mean to have blocking RMA operations that (in MPI terms) essentially
> act as lock-RMA_operation-unlock?
Not really, I meant - performing RMA operations without locks, if locks
are not needed. Instead of making it default, we can have locks as optional.
With respect to completion and synchronization, here is an example of how
it is done in ARMCI.
ARMCI_Put: Ensures local completion and source buffer is safe to reuse
ARMCI_NbPut: Non-blocking put. User has to call wait() to reuse source
ARMCI_Put(dst_proc)+ARMCI_Fence(dst_proc): Ensures local and remote
ARMCI_Allfence. This is collective. Ensures all outstanding RMA
operations are completed on the remote side. Mostly called with Barrier.
ARMCI_Put_Notify/Wait. This is like light-weight message passing. After a
(large) put, the sender can send a notify message, and the destination
process can wait on it. This is useful if send/recv is in redevous
mode and send essentially blocks for a slow receiver. In case of
Put/Put_notify, the sender can issue 2 seperate R(D)MA's and no need to
handshake with the receiver. Thus it is relativly less synchronous. On the
other hand, if the sender is slow, then the receiver has to wait!
Please note that, if such a synchronization/completion is required
between source and destination, then user should avoid using RMA (and use
Send/Recv), as this is truly 2-sided communication.
> > 3. Support for non-overlapping concurrent operations in a window.
> MPI RMA already does this, in all of the active target modes, and in
> the passive target mode with the shared lock mode. What do you need
> that isn't supported?
In the passive target mode, does this require some synchronization with
the source and destination process if they are accessing non-overlapping
operations in the same window. If so, then this might not be truly
one-sided operation. Please correct me if I am wrong.
> > 4. RMW Operation - Useful for implementing dynamic load balancing
> > algorithms (e.g. task queues/work stealing, group/global counters,
> > etc).
> Yes, this is a major gap. Is fetch-and-increment enough, or do you
> need compare-and-swap, or a general RMI?
For ARMCI, we need both: fetch-and-increment and compare-and-swap. A
general RMI is certainly useful, as we have plans to extend the capability
of ARMCI to perform general RMI.
High Performance Computing Group
Pacific Northwest National Laboratory
Ph: (509) 372-4206 Fax: (509) 372-4720
> > The above are feature requirements rather than performance issues,
> > which
> > are implementation specific.
> > If you have any questions, I would be happy to explain the above in
> > detail.
> > Thanks,
> > -Manoj.
> > ---------------------------------------------------------------
> > Manojkumar Krishnan
> > High Performance Computing Group
> > Pacific Northwest National Laboratory
> > Ph: (509) 372-4206 Fax: (509) 372-4720
> > http://hpc.pnl.gov/people/manoj
> > ---------------------------------------------------------------
> > On Wed, 3 Mar 2010, William Gropp wrote:
> >> I went through the mail that was sent out in response to our request
> >> for use cases, and I must say it was underwhelming. I've included a
> >> short summary below; based on this, we aren't looking at the correct
> >> needs. I don't think that these are representative of *all* of the
> >> needs of RMA, but without good use cases, I don't see how we can
> >> justify any but the most limited extensions/changes to the current
> >> design. Please (a) let me know if I overlooked something and (b)
> >> send
> >> me (and the list) additional use cases. For example, we haven't
> >> included any of the issues needed to implement PGAS languages, nor
> >> have we addressed the existing SHMEM codes. Do we simply say that a
> >> high-quality implementation will permit interoperation with whatever
> >> OpenSHMEM is? And what do we do about the RMI issue that one of the
> >> two use cases that we have raises?
> >> Basically, I received two detailed notes for the area of Quantum
> >> Chemistry. In brief:
> >> MPI-2 RMA is already adequate for most parts, as long as the
> >> implementation makes progress (as it is required to) on passive
> >> updates. Requires get or accumulate; rarely requires put.
> >> Dynamic load balancing (a related but separate issue) needs some sort
> >> of RMW. (Thanks to Jeff Hammond)
> >> More complex algorithms and data structures appear to require remote
> >> method invocation (RMI). Sparse tree updates provide one example.
> >> (Thanks to Robert Harrison)
> >> Bill
> >> William Gropp
> >> Deputy Director for Research
> >> Institute for Advanced Computing Applications and Technologies
> >> Paul and Cynthia Saylor Professor of Computer Science
> >> University of Illinois Urbana-Champaign
> >> _______________________________________________
> >> mpi3-rma mailing list
> >> mpi3-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> William Gropp
> Deputy Director for Research
> Institute for Advanced Computing Applications and Technologies
> Paul and Cynthia Saylor Professor of Computer Science
> University of Illinois Urbana-Champaign
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
More information about the mpiwg-rma