[Mpi3-rma] RMA proposal 1 update
Jeff Hammond
jeff.science at gmail.com
Fri May 21 15:11:23 CDT 2010
Please explain how active target gives me ARMCI.
Jeff
Sent from my iPhone
On May 21, 2010, at 3:05 PM, "Barrett, Brian W" <bwbarre at sandia.gov>
wrote:
> I don't know about illegal as much as "doesn't make sense". Keith
> brough this up, but I think it got lost in other discussions...
> What is the semantic of a collective operation during a point-to-
> point epoch? We're constrained by the existing API (as we've
> decided not to write a new API)
>
> I can see a couple of scenarios for allflushall and passive target,
> none of which I like. The first is that allflushall only flushes
> with peers it currently holds a window open to. This means tracking
> such state, which I think is a bad idea due to state required. The
> second option is that it's erroneous to call allflushall unless
> there's a passive access epoch to every peer in the window. This
> seems to encourage a behavior I dont' like (namely, having to have
> such an epoch open at all times).
>
> I could possibly see the benefit of an allflushall in an active
> target, where group semantics are a bit more well-defined, but
> that's an entirely different discussion.
>
> Brian
>
> --
> Brian W. Barrett
> Scalable System Software Group
> Sandia National Laboratories
> ________________________________________
> From: mpi3-rma-bounces at lists.mpi-forum.org [mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Underwood, Keith D
> [keith.d.underwood at intel.com]
> Sent: Friday, May 21, 2010 1:44 PM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
> That's an interesting point. I would think that would be illegal?
>
> Keith
>
>> -----Original Message-----
>> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
>> bounces at lists.mpi-forum.org] On Behalf Of Rajeev Thakur
>> Sent: Friday, May 21, 2010 1:38 PM
>> To: 'MPI 3.0 Remote Memory Access working group'
>> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>>
>> Lock-unlock is not collective. If we add an allflushall, how would
>> the
>> processes that don't call lock-unlock call it? Just directly?
>>
>> Rajeev
>>
>>
>>> -----Original Message-----
>>> From: mpi3-rma-bounces at lists.mpi-forum.org
>>> [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of
>>> Underwood, Keith D
>>> Sent: Friday, May 21, 2010 11:22 AM
>>> To: MPI 3.0 Remote Memory Access working group
>>> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>>>
>>> There is a proposal on the table to add flush(rank) and
>>> flushall() as local calls for passive target to allow
>>> incremental remote completion without calling unlock().
>>> There is an additional proposal to also include
>>> allflushall(), which would be a collective remote completion
>>> semantic added to passive target (nominally, we might not
>>> want to call it passive target anymore if we did that, but
>>> that is a different discussion). The question I had posed
>>> was: can you really get a measurable performance advantage
>>> in realistic usage scenarios by having allflushall() instead
>>> of just doing flushall() + barrier()? I was asking if
>>> someone could provide an implementation sketch - preferably
>>> along with a real usage scenario where that implementation
>>> sketch would yield a performance advantage. A
>>> back-of-the-envelope assessment of that would be a really
>>> nice to have.
>>>
>>> My contention was that the number of targets with outstanding
>>> requests from a given between one flush/flushall and the next
>>> one would frequently not be large enough to justify the cost
>>> of a collective. Alternatively, if messages in that window
>>> are "large" (for some definition of "large" that is likely
>>> less than 4KB and is certainly no larger than the rendezvous
>>> threshold), I would contend that generating a software ack
>>> for each one would be essentially zero overhead and would
>>> allow source side tracking of remote completion such that a
>>> flushall could be a local operation.
>>>
>>> There is a completely separate discussion that needs to occur
>>> on whether it is better to add collective completion to
>>> passive target or one-sided completion to active target (both
>>> of which are likely to meet some resistance because of the
>>> impurity the introduce into the model/naming) or whether the
>>> two need to be mixed at all.
>>>
>>> Keith
>>>
>>>> -----Original Message-----
>>>> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
>>>> bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
>>>> Sent: Friday, May 21, 2010 7:16 AM
>>>> To: MPI 3.0 Remote Memory Access working group
>>>> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>>>>
>>>> It is now obvious that I did not do a sufficient job of explaining
>>>> things
>>>> before. It's not even clear to me anymore exactly what we're
>> talking
>>>> about
>>>> - is this ARMCI? MPI2? MPI3? Active-target one-sided?
>>> Passive target?
>>>> My
>>>> previous explanation was for BG/P MPI2 active-target one-sided
>> using
>>>> MPI_Win_fence synchronization. BG/P MPI2 implementations for
>>>> MPI_Win_post/start/complete/wait and passive-target
>>> MPI_Win_lock/unlock
>>>> use
>>>> different methods for determining target completion, although
>>>> internally
>>>> they may use some/all of the same structures.
>>>>
>>>> What was the original question?
>>>>
>>>> _______________________________________________
>>>> Douglas Miller BlueGene Messaging Development
>>>> IBM Corp., Rochester, MN USA Bldg 030-2 A410
>>>> dougmill at us.ibm.com Douglas Miller/Rochester/IBM
>>>>
>>>>
>>>>
>>>> "Underwood, Keith
>>>> D"
>>>> <keith.d.underwoo
>>>> To
>>>> d at intel.com> "MPI 3.0 Remote Memory
>> Access
>>>> Sent by: working group"
>>>> mpi3-rma-bounces@
>>> <mpi3-rma at lists.mpi-forum.org>
>>>> lists.mpi-forum.o
>>>> cc
>>>> rg
>>>>
>>>> Subject
>>>> Re: [Mpi3-rma] RMA proposal
>> 1
>>>> 05/20/2010 04:56 update
>>>> PM
>>>>
>>>>
>>>> Please respond to
>>>> "MPI 3.0 Remote
>>>> Memory Access
>>>> working group"
>>>> <mpi3-rma at lists.m
>>>> pi-forum.org>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> My point was, the way Jeff is doing synchronization in
>>> NWChem is via a
>>>> fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD.
>>> If I knew
>>>> he
>>>> was going to be primarily doing this (ie, that he wanted to
>>> know that
>>>> all
>>>> nodes were synched), I would do something like maintain
>>> counts of sent
>>>> and
>>>> received messages on each node. I could then do something like an
>>>> allreduce
>>>> of those 2 ints over the tree to determine if everyone is synched.
>>>> There
>>>> are probably some technical details that would have to be
>>> worked out to
>>>> ensure this works but it seems good from 10000 feet.
>>>>
>>>> Right now we do numprocs 0-byte get operations to make sure
>>> the torus
>>>> is
>>>> flushed on each node. A torus operation is ~3us on a
>>> 512-way. It grows
>>>> slowly with number of midplanes. I'm sure a 72 rack longest
>>> Manhattan
>>>> distance noncongested pingpong is <10us, but I don't have
>>> the data in
>>>> front
>>>> of me.
>>>>
>>>> Based on Doug's email, I had assumed you would know who you
>>> have sent
>>>> messages to. If you knew that in a given fence interval
>>> the node had
>>>> only
>>>> sent distinct messages to 1K other cores, you would only
>>> have 1K gets
>>>> to
>>>> issue. Suck? Yes. Worse than the tree messages? Maybe,
>>> maybe not.
>>>> There is definitely a cross-over between 1 and np
>>> outstanding messages
>>>> between fences where on the 1 side of things the tree messages are
>>>> worse
>>>> and on the np side of things the tree messages are better. There
>> is
>>>> another spectrum based on request size where getting a response for
>>>> every
>>>> request becomes an inconsequential overhead. I would have
>>> to know the
>>>> cost
>>>> of processing a message, the size of a response, and the cost of
>>>> generating
>>>> that response to create a proper graph of that.
>>>>
>>>> A tree int/sum is roughly 5us on a 512-way and grows
>>> similarly. I would
>>>> postulate that a 72 rack MPI allreduce int/sum is on the
>>> order of 10us.
>>>>
>>>> So you generate np*np messages vs 1 tree message. Contention and
>> all
>>>> the
>>>> overhead of that many messages will be significantly worse than
>> even
>>>> several tree messages.
>>>> Oh, wait, so, you would sum all sent and sum all received and then
>>>> check if
>>>> they were equal? And then (presumably) iterate until the answer
>> was
>>>> yes?
>>>> Hrm. That is more interesting. Can you easily separate
>>> one-sided and
>>>> two
>>>> sided messages in your counting while maintaining the performance
>> of
>>>> one-sided messages?
>>>> Doug's earlier answer implied you were going to allreduce a
>>> vector of
>>>> counts (one per rank) and that would have been ugly. I am
>> assuming
>>>> you
>>>> would do at least 2 tree messages in what I believe you are
>>> describing,
>>>> so
>>>> there is still a crossover between n*np messages and m tree
>> messages
>>>> (where
>>>> n is the number of outstanding requests between fencealls
>>> and 2 <= m <=
>>>> 10), and the locality of communications impacts that crossover.
>>>> BTW, can you actually generate messages fast enough to
>>> cause contention
>>>> with tiny messages?
>>>> Anytime I know that an operation is collective, I can
>>> almost guarantee
>>>> I
>>>> can do it better than even a good pt2pt algorithm if I am
>>> utilizing our
>>>> collective network. I think on machines that have remote completion
>>>> notification an allfenceall() is just a barrier(), and since
>>>> fenceall();
>>>> barrier(); is going to be replaced by allfenceall(), it
>>> doesn't seem to
>>>> me
>>>> like it is any extra overhead if allfenceall() is just a
>>> barrier() for
>>>> you.
>>>>
>>>>
>>>> My concerns are twofold: 1) we are talking about adding collective
>>>> completion to passive target when active target was the one
>>> designed to
>>>> have collective completion. That is semantically and API-wise a
>> bit
>>>> ugly.
>>>> 2) I think the allfenceall() as a collective will optimize
>>> to the case
>>>> where you have outstanding requests to everybody and I believe that
>>>> will be
>>>> slower than the typical case of having outstanding requests to
>> some
>>>> people. I think that users would typically call
>>> allfenceall() rather
>>>> than
>>>> fenceall() + barrier() and then they would see a
>>> performance paradox:
>>>> the
>>>> fenceall() + barrier() could be substantially faster when you have
>> a
>>>> "small" number of peers you are communicating with in this
>>> iteration.
>>>> I am
>>>> not at all worried about the overhead of allfenceall() for networks
>>>> with
>>>> remote completion.
>>>> Keith
>>>>
>>>>
>>>>
>>>> From: "Underwood, Keith D" <keith.d.underwood at intel.com>
>>>>
>>>> To: "MPI 3.0 Remote Memory Access working group"
>>>> <mpi3-rma at lists.mpi-forum.org>
>>>>
>>>> Date: 05/20/2010 09:19 AM
>>>>
>>>> Subject Re: [Mpi3-rma] RMA proposal 1 update
>>>> :
>>>>
>>>> Sent mpi3-rma-bounces at lists.mpi-forum.org
>>>> by:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> mpi3-rma mailing list
>>> mpi3-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>>
>>
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
More information about the mpiwg-rma
mailing list