[Mpi3-rma] Use cases for RMA

Wed Mar 3 16:10:18 CST 2010

Thanks.  More comments inline.

On Mar 3, 2010, at 3:40 PM, Manojkumar Krishnan wrote:

>
> Bill,
>
> Please scroll for my comments.
>
> On Wed, 3 Mar 2010, William Gropp wrote:
>
>> Thanks.  I've added some comments inline.
>>
>> On Mar 3, 2010, at 1:23 PM, Manojkumar Krishnan wrote:
>>
>>>
>>> Bill,
>>>
>>> Here are some of the MPI RMA requirements for Global Arrays (GA) and
>>> ARMCI. ARMCI is GA's runtime system.  GA/ARMCI exploits native
>>> network communication interfaces and system resources (such as  
>>> shared
>>> memory) to achieve the best possible performance of the remote  
>>> memory
>>> access/one-sided communication. GA/ARMCI relies *heavily* on  
>>> optimized
>>> contiguous and non-contiguous RMA operations (get/put/acc).
>>>
>>> For GA/ARMCI and its applications, below are some specfic examples  
>>> of
>>> operations that are hard to achieve in MPI-2 RMA.
>>>
>>> 1. Memory Allocation: (This might be an implementation issue) The
>>> user or
>>> library implementors should be able to allocate memory (e.g. shared
>>> memory), and register with MPI. This is useful in case of Global
>>> Arrays/ARMCI, which use RMA across nodes and shared memory within
>>> nodes.
>>> ARMCI allocates shared memory segment, and pins/registers with the
>>> network.
>>
>> MPI already provides MPI_Alloc_mem , though this is not necessarily
>> shared memory usable by other processes.  But the MPI implementation
>> could allocate this from a shared memory pool and perform one-sided
>> operations using load-store.  Do you need something different?
>>
>
> The thing is, this involves additional copy via RMA_Get. Is it
> possible for other processes within the same SMP node access the  
> memory
> directly (rather than get/put using load-store).
>
> For example something like this. In GA, you get *direct* access
> (pointer) to memory regions allocated by processes within SMP node. If
> proc 0 and 1 are in the same node, proc 1 can access proc 0's region
> without explict get/put (i.e. proc 1 gets a pointer to this region).

What we may need is a way to provide shared memory (provided by some  
other service) to an MPI Window, and arrange the access/completion for  
it.  This is something we should look at.

>
>>>
>>> 2. Locks: Should be made optional to keep the RMA programming model
>>> simple. If the user doesnot require concurrency, then locks are
>>> unnecessary. Enforcing to use locks as default might introduce
>>> unnecessary bugs if not carefully programmed (esp. at the level of  
>>> of
>>> extreme scale systems, it is hard to debug).
>>>
>>
>> You need something to provide for local and remote completion.  Do  
>> you
>> mean to have blocking RMA operations that (in MPI terms) essentially
>> act as lock-RMA_operation-unlock?
>
> Not really, I meant - performing RMA operations without locks, if  
> locks
> are not needed. Instead of making it default, we can have locks as  
> optional.

I still don't understand what the problem is with "locks" (note that  
the name is poorly chosen, it isn't really a lock in the usual  
sense).  What lock/unlock does is simple indicate "begin RMA access at  
target" and "end RMA access at target", with the option of permitting  
exclusive access to the target window. In the below, I've put what I  
think are the equivalent MPI RMA calls:

>
> With respect to completion and synchronization, here is an example  
> of how
> it is done in ARMCI.
>
> ARMCI_Put: Ensures local completion and source buffer is safe to reuse

Lock/Put/Unlock .  Permits concurrent non-overlapping destination  
updates if shared mode is selected.

>
> ARMCI_NbPut: Non-blocking put. User has to call wait() to reuse source
> buffer.

Lock/Put .  Unlock is like Wait
>
> ARMCI_Put(dst_proc)+ARMCI_Fence(dst_proc): Ensures local and remote
> completion.

Also Lock/Put/Unlock

>
> ARMCI_Allfence. This is collective. Ensures all outstanding RMA
> operations are completed on the remote side. Mostly called with  
> Barrier.

Win_fence  (active target sync)

>
> ARMCI_Put_Notify/Wait. This is like light-weight message passing.  
> After a
> (large) put, the sender can send a notify message, and the destination
> process can wait on it. This is useful if send/recv is in redevous
> mode and send essentially blocks for a slow receiver. In case of
> Put/Put_notify, the sender can issue 2 seperate R(D)MA's and no need  
> to
> handshake with the receiver. Thus it is relativly less synchronous.  
> On the
> other hand, if the sender is slow, then the receiver has to wait!
>
> Please note that, if such a synchronization/completion is required
> between source and destination, then user should avoid using RMA  
> (and use
> Send/Recv), as this is truly 2-sided communication.
>
>>
>>> 3. Support for non-overlapping concurrent operations in a window.
>>
>> MPI RMA already does this, in all of the active target modes, and in
>> the passive target mode with the shared lock mode.  What do you need
>> that isn't supported?
>
> In the passive target mode, does this require some synchronization  
> with
> the source and destination process if they are accessing non- 
> overlapping
> operations in the same window. If so, then this might not be truly
> one-sided operation. Please correct me if I am wrong.

The only synchronization is that an exclusive_lock must block  
subsequent shared locks and must wait for executing shared locks to  
complete.  Is the issue that you never want to check?  If so, then  
this could be added as a Window characteristic.

>
>>
>>>
>>> 4. RMW Operation - Useful for implementing dynamic load balancing
>>> algorithms (e.g. task queues/work stealing, group/global counters,
>>> etc).
>>>
>>
>> Yes, this is a major gap.  Is fetch-and-increment enough, or do you
>> need compare-and-swap, or a general RMI?
>
> For ARMCI, we need both: fetch-and-increment and compare-and-swap. A
> general RMI is certainly useful, as we have plans to extend the  
> capability
> of ARMCI to perform general RMI.
>
> -- 
> Thanks,
> -Manoj.
> ---------------------------------------------------------------
> Manojkumar Krishnan
> High Performance Computing Group
> Pacific Northwest National Laboratory
> Ph: (509) 372-4206   Fax: (509) 372-4720
> http://hpc.pnl.gov/people/manoj
> ---------------------------------------------------------------
>
>
>>
>> Bill
>>
>>> The above are feature requirements rather than performance issues,
>>> which
>>> are implementation specific.
>>>
>>> If you have any questions, I would be happy to explain the above in
>>> detail.
>>>
>>> Thanks,
>>> -Manoj.
>>> ---------------------------------------------------------------
>>> Manojkumar Krishnan
>>> High Performance Computing Group
>>> Pacific Northwest National Laboratory
>>> Ph: (509) 372-4206   Fax: (509) 372-4720
>>> http://hpc.pnl.gov/people/manoj
>>> ---------------------------------------------------------------
>>>
>>> On Wed, 3 Mar 2010, William Gropp wrote:
>>>
>>>> I went through the mail that was sent out in response to our  
>>>> request
>>>> for use cases, and I must say it was underwhelming.  I've  
>>>> included a
>>>> short summary below; based on this, we aren't looking at the  
>>>> correct
>>>> needs.  I don't think that these are representative of *all* of the
>>>> needs of RMA, but without good use cases, I don't see how we can
>>>> justify any but the most limited extensions/changes to the current
>>>> design.  Please (a) let me know if I overlooked something and (b)
>>>> send
>>>> me (and the list) additional use cases.  For example, we haven't
>>>> included any of the issues needed to implement PGAS languages, nor
>>>> have we addressed the existing SHMEM codes.  Do we simply say  
>>>> that a
>>>> high-quality implementation will permit interoperation with  
>>>> whatever
>>>> OpenSHMEM is?  And what do we do about the RMI issue that one of  
>>>> the
>>>> two use cases that we have raises?
>>>>
>>>> Basically, I received two detailed notes for the area of Quantum
>>>> Chemistry.  In brief:
>>>>
>>>> MPI-2 RMA is already adequate for most parts, as long as the
>>>> implementation makes progress (as it is required to) on passive
>>>> updates.  Requires get or accumulate; rarely requires put.
>>>> Dynamic load balancing (a related but separate issue) needs some  
>>>> sort
>>>> of RMW.  (Thanks to Jeff Hammond)
>>>>
>>>> More complex algorithms and data structures appear to require  
>>>> remote
>>>> method invocation (RMI).  Sparse tree updates provide one example.
>>>> (Thanks to Robert Harrison)
>>>>
>>>> Bill
>>>>
>>>>
>>>> William Gropp
>>>> Deputy Director for Research
>>>> Institute for Advanced Computing Applications and Technologies
>>>> Paul and Cynthia Saylor Professor of Computer Science
>>>> University of Illinois Urbana-Champaign
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-rma mailing list
>>>> mpi3-rma at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>>>
>>> _______________________________________________
>>> mpi3-rma mailing list
>>> mpi3-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
>> William Gropp
>> Deputy Director for Research
>> Institute for Advanced Computing Applications and Technologies
>> Paul and Cynthia Saylor Professor of Computer Science
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma

William Gropp
Deputy Director for Research
Institute for Advanced Computing Applications and Technologies
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign