[Mpi3-rma] Use cases for RMA

Jeff Hammond jeff.science at gmail.com
Wed Mar 3 16:17:31 CST 2010


I don't think it makes sense to compare MPI-3 to GA.  We should be
talking about ARMCI versus MPI-3.  Part of GA is the one-sided
transport layer and the other part is data management.  Sharing
pointers on a node is something a library layer such as GA could do on
top of MPI-3, especially since, unlike ARMCI, MPI RMA collective
registration attaches to existing segments.  Hence the ability to do
NIC bypass for on-node RMA or direct access to pointers is already
possible in the context of MPI RMA, albeit with more restrictions on
the validity of these operations and via a higher-level library on top
of MPI.  Furthering my point, direct access is not something that
ARMCI provides - both according to the ARMCI documentation and my
reading of the GA source - so it does not make sense to talk about
adding this to MPI-3 since the stated goal for a long time has been to
have GA run on top of MPI-3 rather than MPI-3 replace GA.

Pavan and I debated the having locks optional when not necessary.  I
do not find this appealing.  It means that code is non-portable, that
is, I have to #ifdef (in) out locks on (non-)cache-coherent.  It is
far better for implementations to optimize locks to null operations
when unnecessary, i.e. cache-coherent with RDMA.

Best,

Jeff

On Wed, Mar 3, 2010 at 3:40 PM, Manojkumar Krishnan <manoj at pnl.gov> wrote:
>
> Bill,
>
> Please scroll for my comments.
>
> On Wed, 3 Mar 2010, William Gropp wrote:
>
>> Thanks.  I've added some comments inline.
>>
>> On Mar 3, 2010, at 1:23 PM, Manojkumar Krishnan wrote:
>>
>> >
>> > Bill,
>> >
>> > Here are some of the MPI RMA requirements for Global Arrays (GA) and
>> > ARMCI. ARMCI is GA's runtime system.  GA/ARMCI exploits native
>> > network communication interfaces and system resources (such as shared
>> > memory) to achieve the best possible performance of the remote memory
>> > access/one-sided communication. GA/ARMCI relies *heavily* on optimized
>> > contiguous and non-contiguous RMA operations (get/put/acc).
>> >
>> > For GA/ARMCI and its applications, below are some specfic examples of
>> > operations that are hard to achieve in MPI-2 RMA.
>> >
>> > 1. Memory Allocation: (This might be an implementation issue) The
>> > user or
>> > library implementors should be able to allocate memory (e.g. shared
>> > memory), and register with MPI. This is useful in case of Global
>> > Arrays/ARMCI, which use RMA across nodes and shared memory within
>> > nodes.
>> > ARMCI allocates shared memory segment, and pins/registers with the
>> > network.
>>
>> MPI already provides MPI_Alloc_mem , though this is not necessarily
>> shared memory usable by other processes.  But the MPI implementation
>> could allocate this from a shared memory pool and perform one-sided
>> operations using load-store.  Do you need something different?
>>
>
> The thing is, this involves additional copy via RMA_Get. Is it
> possible for other processes within the same SMP node access the memory
> directly (rather than get/put using load-store).
>
> For example something like this. In GA, you get *direct* access
> (pointer) to memory regions allocated by processes within SMP node. If
> proc 0 and 1 are in the same node, proc 1 can access proc 0's region
> without explict get/put (i.e. proc 1 gets a pointer to this region).
>> >
>> > 2. Locks: Should be made optional to keep the RMA programming model
>> > simple. If the user doesnot require concurrency, then locks are
>> > unnecessary. Enforcing to use locks as default might introduce
>> > unnecessary bugs if not carefully programmed (esp. at the level of of
>> > extreme scale systems, it is hard to debug).
>> >
>>
>> You need something to provide for local and remote completion.  Do you
>> mean to have blocking RMA operations that (in MPI terms) essentially
>> act as lock-RMA_operation-unlock?
>
> Not really, I meant - performing RMA operations without locks, if locks
> are not needed. Instead of making it default, we can have locks as optional.
>
> With respect to completion and synchronization, here is an example of how
> it is done in ARMCI.
>
> ARMCI_Put: Ensures local completion and source buffer is safe to reuse
>
> ARMCI_NbPut: Non-blocking put. User has to call wait() to reuse source
> buffer.
>
> ARMCI_Put(dst_proc)+ARMCI_Fence(dst_proc): Ensures local and remote
> completion.
>
> ARMCI_Allfence. This is collective. Ensures all outstanding RMA
> operations are completed on the remote side. Mostly called with Barrier.
>
> ARMCI_Put_Notify/Wait. This is like light-weight message passing. After a
> (large) put, the sender can send a notify message, and the destination
> process can wait on it. This is useful if send/recv is in redevous
> mode and send essentially blocks for a slow receiver. In case of
> Put/Put_notify, the sender can issue 2 seperate R(D)MA's and no need to
> handshake with the receiver. Thus it is relativly less synchronous. On the
> other hand, if the sender is slow, then the receiver has to wait!
>
> Please note that, if such a synchronization/completion is required
> between source and destination, then user should avoid using RMA (and use
> Send/Recv), as this is truly 2-sided communication.
>
>>
>> > 3. Support for non-overlapping concurrent operations in a window.
>>
>> MPI RMA already does this, in all of the active target modes, and in
>> the passive target mode with the shared lock mode.  What do you need
>> that isn't supported?
>
> In the passive target mode, does this require some synchronization with
> the source and destination process if they are accessing non-overlapping
> operations in the same window. If so, then this might not be truly
> one-sided operation. Please correct me if I am wrong.
>
>>
>> >
>> > 4. RMW Operation - Useful for implementing dynamic load balancing
>> > algorithms (e.g. task queues/work stealing, group/global counters,
>> > etc).
>> >
>>
>> Yes, this is a major gap.  Is fetch-and-increment enough, or do you
>> need compare-and-swap, or a general RMI?
>
> For ARMCI, we need both: fetch-and-increment and compare-and-swap. A
> general RMI is certainly useful, as we have plans to extend the capability
> of ARMCI to perform general RMI.
>
> --
> Thanks,
> -Manoj.
> ---------------------------------------------------------------
> Manojkumar Krishnan
> High Performance Computing Group
> Pacific Northwest National Laboratory
> Ph: (509) 372-4206   Fax: (509) 372-4720
> http://hpc.pnl.gov/people/manoj
> ---------------------------------------------------------------
>
>
>>
>> Bill
>>
>> > The above are feature requirements rather than performance issues,
>> > which
>> > are implementation specific.
>> >
>> > If you have any questions, I would be happy to explain the above in
>> > detail.
>> >
>> > Thanks,
>> > -Manoj.
>> > ---------------------------------------------------------------
>> > Manojkumar Krishnan
>> > High Performance Computing Group
>> > Pacific Northwest National Laboratory
>> > Ph: (509) 372-4206   Fax: (509) 372-4720
>> > http://hpc.pnl.gov/people/manoj
>> > ---------------------------------------------------------------
>> >
>> > On Wed, 3 Mar 2010, William Gropp wrote:
>> >
>> >> I went through the mail that was sent out in response to our request
>> >> for use cases, and I must say it was underwhelming.  I've included a
>> >> short summary below; based on this, we aren't looking at the correct
>> >> needs.  I don't think that these are representative of *all* of the
>> >> needs of RMA, but without good use cases, I don't see how we can
>> >> justify any but the most limited extensions/changes to the current
>> >> design.  Please (a) let me know if I overlooked something and (b)
>> >> send
>> >> me (and the list) additional use cases.  For example, we haven't
>> >> included any of the issues needed to implement PGAS languages, nor
>> >> have we addressed the existing SHMEM codes.  Do we simply say that a
>> >> high-quality implementation will permit interoperation with whatever
>> >> OpenSHMEM is?  And what do we do about the RMI issue that one of the
>> >> two use cases that we have raises?
>> >>
>> >> Basically, I received two detailed notes for the area of Quantum
>> >> Chemistry.  In brief:
>> >>
>> >> MPI-2 RMA is already adequate for most parts, as long as the
>> >> implementation makes progress (as it is required to) on passive
>> >> updates.  Requires get or accumulate; rarely requires put.
>> >> Dynamic load balancing (a related but separate issue) needs some sort
>> >> of RMW.  (Thanks to Jeff Hammond)
>> >>
>> >> More complex algorithms and data structures appear to require remote
>> >> method invocation (RMI).  Sparse tree updates provide one example.
>> >> (Thanks to Robert Harrison)
>> >>
>> >> Bill
>> >>
>> >>
>> >> William Gropp
>> >> Deputy Director for Research
>> >> Institute for Advanced Computing Applications and Technologies
>> >> Paul and Cynthia Saylor Professor of Computer Science
>> >> University of Illinois Urbana-Champaign
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> mpi3-rma mailing list
>> >> mpi3-rma at lists.mpi-forum.org
>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>> >>
>> > _______________________________________________
>> > mpi3-rma mailing list
>> > mpi3-rma at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
>> William Gropp
>> Deputy Director for Research
>> Institute for Advanced Computing Applications and Technologies
>> Paul and Cynthia Saylor Professor of Computer Science
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>



-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond




More information about the mpiwg-rma mailing list