[Mpi3-rma] mpi3-rma post from bradc at cray.com requires approval

Underwood, Keith D keith.d.underwood at intel.com
Sat Jun 5 21:08:05 CDT 2010


Well, the use case isn't theoretical - my understanding is that Quadrics explicitly supported that driven by customer needs; however, as I said before, (2) is the one that I think is really critical.  It is the only one I believe that you have to have to get something resembling sequential consistency from a single thread.  

Keith

> -----Original Message-----
> From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
> Sent: Saturday, June 05, 2010 9:33 PM
> To: Underwood, Keith D
> Cc: MPI 3.0 Remote Memory Access working group; bradc at cray.com
> Subject: Re: [Mpi3-rma] mpi3-rma post from bradc at cray.com requires
> approval
> 
> 
> Guaranteeing (1) in hardware is not easy when the message is split
> across two or more adapters, or two or more routes.
> 
> Apart from a theoretical use case, is there a real need for this?
> 
>   -- Pavan
> 
> On 06/05/2010 08:17 PM, Underwood, Keith D wrote:
> > I tend to agree that (2) is what is critical, but both (1) & (2) may
> be important.  The problem with not having (1) is that it gets
> significantly more expensive to figure out when a message has been
> delivered.  <shrug> that may be ok, but may be a pain.  Arguably, if
> not having (1) were important to a network, you could teach the users
> to do finer grained accesses such that each access was unordered
> relative to the others.
> >
> > Anyway, the one important point here is that it is MUCH harder to get
> any of these back at the application level than it is to provide them
> at the hardware level.  If the API doesn't expose a given type of
> ordering, you can't make an application "do the right thing" and count
> on good hardware giving you that ordering - even if it is easy for the
> hardware.
> >
> > I'll give a specific example:  research on low-diameter networks has
> indicated that you get relatively little actual reordering at the end-
> points when you adaptively route through a low diameter network.  Given
> that, the end-point could adaptively route and still give you ordering
> at the API level; however, because not all hardware will do that, an
> application would have to be written as if it had to restore ordering
> when it needed it.  This would suck beyond words...
> >
> > Keith
> >
> >> Thanks for listing these. If we are voting for this, my vote would
> be
> >> to
> >> have (2) and toss out (1) and (3).
> >>
> >>   -- Pavan
> >>
> >> On 06/05/2010 02:40 PM, Underwood, Keith D wrote:
> >>> I was only giving an example of how tightly ordering COULD be
> >> defined.  Ordering options include:
> >>> 1) Ordering within a given replace:  is the first byte guaranteed
> to
> >> get there before the last?
> >>> 2) Ordering between replaces to a given location:  but, what if two
> >> replaces are overlapping?
> >>> 3) Ordering among all replaces to a given node
> >>>
> >>> Two sided gives you something weird, in that it orders the matching
> >> of the message headers and not the end of messages or data within
> the
> >> messages.
> >>> Keith
> >>>
> >>>> -----Original Message-----
> >>>> From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
> >>>> Sent: Saturday, June 05, 2010 3:30 PM
> >>>> To: Underwood, Keith D
> >>>> Cc: MPI 3.0 Remote Memory Access working group; bradc at cray.com
> >>>> Subject: Re: [Mpi3-rma] mpi3-rma post from bradc at cray.com requires
> >>>> approval
> >>>>
> >>>>
> >>>> I see. My definition of ordering was a little bit different from
> >> yours.
> >>>> My definition was -- if I do two accumulates with replace on the
> >> same
> >>>> location, I'm guaranteed to have the second value in the location.
> >> It
> >>>> didn't have any definition of ordering to two different locations.
> >>>>
> >>>> So, I think we need to come to a consensus first on what the
> actual
> >>>> definition of ordering is.
> >>>>
> >>>>   -- Pavan
> >>>>
> >>>> On 06/05/2010 02:22 PM, Underwood, Keith D wrote:
> >>>>>>> We would need to think about whether we have to have the whole
> >>>>>>> message ordered or ordered on a per target address basis.
> >>>>>> Atomicity and ordering go hand-in-hand; if there's no atomicity,
> >>>>>> ordering doesn't make sense. Since we have basic datatype
> >> atomicity
> >>>> for
> >>>>>> accumulate/get_accumulate, ordering would make sense at that
> >>>>>> granularity
> >>>>>> as well.
> >>>>>>
> >>>>>> If someone wants to propose full-message atomicity, then we can
> >>>>>> consider
> >>>>>> ordering at that granularity too. But till then, whole message
> >>>> ordering
> >>>>>> is an overkill.
> >>>>> Well, they aren't orthogonal, but they aren't quite that tightly
> >>>> linked.  A user that knew that two messages were not going to
> >> overlap
> >>>> might want to use a full message ordering from a single node for
> >>>> completion detection.  E.g. an MPI_Accumulate() with "replace" to
> >> one
> >>>> buffer and then an MPI_Accumulate() to another buffer to increment
> a
> >>>> variable and use the full message ordering to be able to use the
> >> latter
> >>>> for completion without the expense of a flush() between the
> >> messages.
> >>>> So, it has value and a usage scenario.  I just don't know if we
> want
> >> to
> >>>> go that far or not.
> >>>>> Keith
> >>>> --
> >>>> Pavan Balaji
> >>>> http://www.mcs.anl.gov/~balaji
> >> --
> >> Pavan Balaji
> >> http://www.mcs.anl.gov/~balaji
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji




More information about the mpiwg-rma mailing list