[Mpi3-rma] RMA synchronization optimization [was: Updated MPI-3 RMA proposal 1]
Douglas Miller
dougmill at us.ibm.com
Fri Jun 25 07:20:33 CDT 2010
I will have to dig through the code and change logs and come up with a more
quantitative explanation.
Certainly, the issue is one of latency, not bandwidth. Consider platforms
like BlueGene where processor speed compared to network speed makes it
important to reduce software overhead required to setup communications, and
even one set of load,compare,branch instructions can significantly affect
latency.
_______________________________________________
Douglas Miller BlueGene Messaging Development
IBM Corp., Rochester, MN USA Bldg 030-2 A410
dougmill at us.ibm.com Douglas Miller/Rochester/IBM
Pavan Balaji
<balaji at mcs.anl.g
ov> To
Sent by: "MPI 3.0 Remote Memory Access
mpi3-rma-bounces@ working group"
lists.mpi-forum.o <mpi3-rma at lists.mpi-forum.org>
rg cc
Subject
06/23/2010 05:51 Re: [Mpi3-rma] RMA synchronization
PM optimization [was: Updated MPI-3
RMA proposal 1]
Please respond to
"MPI 3.0 Remote
Memory Access
working group"
<mpi3-rma at lists.m
pi-forum.org>
Doug,
Yes, MPI calls the lock/unlock and fences as epochs. But what I was
trying to point out is that your RMA operations are restricted within
win_create/free. Anyway, don't worry about that -- my attempt to clarify
ended up creating more confusion.
With respect to the your description below, there's still no detail on
why multiple epoch type possibilities are making the code slower. Your
description below continues to argue that it makes it more complex, but
there's no clear description of how this complexity translates to
performance impact.
-- Pavan
On 06/23/2010 11:42 AM, Douglas Miller wrote:
> Does the MPI standard state that RMA operations can commence without
having
> called FENCE, START, or LOCK? Is it legal to do WIN_CREATE, RMA...,
> WIN_FREE? Doesn't the MPI2 spec talk about epochs being between synch
> calls? like between two FENCE calls, or between START and COMPLETE or
POST
> and WAIT, or between LOCK and UNLOCK? Certainly a rank may be the target
of
> a LOCK without having performed any explicit operation aside from
> WIN_CREATE.
>
> I know that the reference MPICH implementation (ch3/nemesis) does not
> actually perform and RMA until the end of the epoch, and so it has much
> more information and can process the entire epoch at once, atomically.
But
> for other platforms where it makes sense to get communications (RMA)
> started as early as possible, that means the synchronization epoch needs
to
> be handled differently, in a more complex way. Because we need to
actually
> start communications when the PUT, GET, or ACCUMULATE is called, that
means
> the synch epoch has to be setup before that point. It also means that
there
> is less information available than if everything were queued and examined
> as-a-whole at epoch-end. There seems to be an expectation that one-sided
> operations will be faster than 2-sided, but this has not been the case
due
> to the overhead of synchronization. Perhaps the 2-sided communication is
> just too fast, but it sure looks as though all this synchronization is
just
> getting in the way.
>
>
>
> _______________________________________________
> Douglas Miller BlueGene Messaging Development
> IBM Corp., Rochester, MN USA Bldg 030-2 A410
> dougmill at us.ibm.com Douglas Miller/Rochester/IBM
>
>
>
> Pavan Balaji
> <balaji at mcs.anl.g
> ov>
To
> Sent by: "MPI 3.0 Remote Memory Access
> mpi3-rma-bounces@ working group"
> lists.mpi-forum.o <mpi3-rma at lists.mpi-forum.org>
> rg
cc
>
>
Subject
> 06/23/2010 08:57 Re: [Mpi3-rma] RMA synchronization
> AM optimization [was: Updated MPI-3
> RMA proposal 1]
>
> Please respond to
> "MPI 3.0 Remote
> Memory Access
> working group"
> <mpi3-rma at lists.m
> pi-forum.org>
>
>
>
>
>
>
> Hi Doug,
>
> On 06/23/2010 08:01 AM, Douglas Miller wrote:
>> Right, the amount of code to maintain does increase, especially in the
> case
>> that nothing is deprecated. My concern is for the performance of "common
>> use" cases, which I think are where only one synchronization mode is
used
>> (is this not true? are there any "real" codes using this?).
>
> From your description it is (somewhat) clear that the code complexity
> does increase, but to me it's not clear that it becomes more
> inefficient. Why does the possibility that an RMA operation might happen
> sometime later make it more inefficient?
>
> They way the MPI standard is structured is that RMA operations can
> happen anytime between Win_create/alloc and Win_free, which seems like
> an "epoch" in terms of your expectation.
>
> -- Pavan
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
More information about the mpiwg-rma
mailing list