[Mpi3-rma] nonblocking MPI_Win_create etc.?

Thu Sep 22 21:50:14 CDT 2011

I guess that works in theory but it precludes a number of
optimizations that would be possible with nonblocking window creation
of the traditional variety.  In particular, I do not see how
MPI_GET_ADDRESS addresses the issue of memory registration.  So I
communicate a virtual address to the origin process.  How then does
either NIC get the physical address registration required for RDMA?
Would one instead be limited to whatever protocol supported RMA with
virtual addresses?  I guess PERCS and Gemini don't care but Blue Gene
and Infiniband seem to have a performance problem in that case.

Jeff

On Thu, Sep 22, 2011 at 5:20 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
> The implementation of MPI_WIN_CREATE_DYNAMIC(info, comm, win) need not have O(N) metadata I think since there is no base address or disp_unit argument passed separately by each process.
>
> Rajeev
>
>
> On Sep 22, 2011, at 5:08 PM, Jeff Hammond wrote:
>
>> The reason to put windows on subgroups is to avoid the O(N) metadata
>> in the window associated with registered memory.  For example, on BGP
>> a window has an O(N) allocation for DCMF memregions.  In the code my
>> friend develops, N=300000 on comm_world but N<200 on a subgroup.  He
>> is at the limit of available memory, which is what motivated the use
>> case for subgroup windows in the first place.
>>
>> I do not see how one can avoid O(N) metadata with
>> MPI_Win_create_dynamic on comm_world in the general case, unless one
>> completely abandons RDMA.  How exactly does registered memory become
>> visible when the user calls MPI_Win_attach?
>>
>> Jeff
>>
>> On Thu, Sep 22, 2011 at 4:58 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>>> In the new RMA, he could just call MPI_Win_create_dynamic once on comm_world and then locally attach memory to it using MPI_Win_attach. (And avoid using fence synchronization.)
>>>
>>> Rajeev
>>>
>>> On Sep 22, 2011, at 4:25 PM, Jeff Hammond wrote:
>>>
>>>> I work with someone who has a use case for nonblocking window creation
>>>> because can get into a deadlock situation unless he does a lot of
>>>> bookkeeping.  He's creating windows on subgroups of world that can
>>>> (will) overlap.  In order to prevent deadlock, he will have to do a
>>>> global collective and figure out how to order all of the window
>>>> creation calls so that they do not deadlock, or in the case where that
>>>> requires solving an NP-hard problem (it smells like the scheduling
>>>> problem to me) or requires too much storage to be practical (he works
>>>> at Juelich and regularly runs on 72 racks in VN mode), he will have to
>>>> serialize window creation globally.
>>>>
>>>> Nonblocking window creation and a waitall solves this problem.
>>>>
>>>> Thoughts?  I wonder if the semantics of nonblocking collectives -
>>>> which do not have tags - are even sufficient in the general case.
>>>>
>>>> Jeff
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>>>> _______________________________________________
>>>> mpi3-rma mailing list
>>>> mpi3-rma at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>>
>>>
>>> _______________________________________________
>>> mpi3-rma mailing list
>>> mpi3-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/index.php/User:Jhammond