[Mpi3-rma] nonblocking MPI_Win_create etc.?

Fri Sep 23 09:40:06 CDT 2011

On Thu, Sep 22, 2011 at 7:50 PM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:
> I guess that works in theory but it precludes a number of
> optimizations that would be possible with nonblocking window creation
> of the traditional variety.  In particular, I do not see how
> MPI_GET_ADDRESS addresses the issue of memory registration.  So I
> communicate a virtual address to the origin process.  How then does
> either NIC get the physical address registration required for RDMA?
> Would one instead be limited to whatever protocol supported RMA with
> virtual addresses?  I guess PERCS and Gemini don't care but Blue Gene
> and Infiniband seem to have a performance problem in that case.

The first RMA may not be able to use the physical address (may need to
over send/recv). However, the remote address (and keys) can be cached
for future use after the first communication. As Sreeram also pointed
out, this does not require O(N) storage.

Performance impact will be negligible if this application keeps this
window for any reasonable length of time.

Sayantan.

>
> Jeff
>
> On Thu, Sep 22, 2011 at 5:20 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>> The implementation of MPI_WIN_CREATE_DYNAMIC(info, comm, win) need not have O(N) metadata I think since there is no base address or disp_unit argument passed separately by each process.
>>
>> Rajeev
>>
>>
>> On Sep 22, 2011, at 5:08 PM, Jeff Hammond wrote:
>>
>>> The reason to put windows on subgroups is to avoid the O(N) metadata
>>> in the window associated with registered memory.  For example, on BGP
>>> a window has an O(N) allocation for DCMF memregions.  In the code my
>>> friend develops, N=300000 on comm_world but N<200 on a subgroup.  He
>>> is at the limit of available memory, which is what motivated the use
>>> case for subgroup windows in the first place.
>>>
>>> I do not see how one can avoid O(N) metadata with
>>> MPI_Win_create_dynamic on comm_world in the general case, unless one
>>> completely abandons RDMA.  How exactly does registered memory become
>>> visible when the user calls MPI_Win_attach?
>>>
>>> Jeff
>>>
>>> On Thu, Sep 22, 2011 at 4:58 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>>>> In the new RMA, he could just call MPI_Win_create_dynamic once on comm_world and then locally attach memory to it using MPI_Win_attach. (And avoid using fence synchronization.)
>>>>
>>>> Rajeev
>>>>
>>>> On Sep 22, 2011, at 4:25 PM, Jeff Hammond wrote:
>>>>
>>>>> I work with someone who has a use case for nonblocking window creation
>>>>> because can get into a deadlock situation unless he does a lot of
>>>>> bookkeeping.  He's creating windows on subgroups of world that can
>>>>> (will) overlap.  In order to prevent deadlock, he will have to do a
>>>>> global collective and figure out how to order all of the window
>>>>> creation calls so that they do not deadlock, or in the case where that
>>>>> requires solving an NP-hard problem (it smells like the scheduling
>>>>> problem to me) or requires too much storage to be practical (he works
>>>>> at Juelich and regularly runs on 72 racks in VN mode), he will have to
>>>>> serialize window creation globally.
>>>>>
>>>>> Nonblocking window creation and a waitall solves this problem.
>>>>>
>>>>> Thoughts?  I wonder if the semantics of nonblocking collectives -
>>>>> which do not have tags - are even sufficient in the general case.
>>>>>
>>>>> Jeff
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>>>>> _______________________________________________
>>>>> mpi3-rma mailing list
>>>>> mpi3-rma at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-rma mailing list
>>>> mpi3-rma at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>>>
>>> _______________________________________________
>>> mpi3-rma mailing list
>>> mpi3-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>