[Mpi3-rma] mpi3-rma post from bradc at cray.com requires approval

Pavan Balaji balaji at mcs.anl.gov
Sat Jun 5 20:33:11 CDT 2010

Guaranteeing (1) in hardware is not easy when the message is split 
across two or more adapters, or two or more routes.

Apart from a theoretical use case, is there a real need for this?

  -- Pavan

On 06/05/2010 08:17 PM, Underwood, Keith D wrote:
> I tend to agree that (2) is what is critical, but both (1) & (2) may be important.  The problem with not having (1) is that it gets significantly more expensive to figure out when a message has been delivered.  <shrug> that may be ok, but may be a pain.  Arguably, if not having (1) were important to a network, you could teach the users to do finer grained accesses such that each access was unordered relative to the others.
> Anyway, the one important point here is that it is MUCH harder to get any of these back at the application level than it is to provide them at the hardware level.  If the API doesn't expose a given type of ordering, you can't make an application "do the right thing" and count on good hardware giving you that ordering - even if it is easy for the hardware.  
> I'll give a specific example:  research on low-diameter networks has indicated that you get relatively little actual reordering at the end-points when you adaptively route through a low diameter network.  Given that, the end-point could adaptively route and still give you ordering at the API level; however, because not all hardware will do that, an application would have to be written as if it had to restore ordering when it needed it.  This would suck beyond words...
> Keith
>> Thanks for listing these. If we are voting for this, my vote would be
>> to
>> have (2) and toss out (1) and (3).
>>   -- Pavan
>> On 06/05/2010 02:40 PM, Underwood, Keith D wrote:
>>> I was only giving an example of how tightly ordering COULD be
>> defined.  Ordering options include:
>>> 1) Ordering within a given replace:  is the first byte guaranteed to
>> get there before the last?
>>> 2) Ordering between replaces to a given location:  but, what if two
>> replaces are overlapping?
>>> 3) Ordering among all replaces to a given node
>>> Two sided gives you something weird, in that it orders the matching
>> of the message headers and not the end of messages or data within the
>> messages.
>>> Keith
>>>> -----Original Message-----
>>>> From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
>>>> Sent: Saturday, June 05, 2010 3:30 PM
>>>> To: Underwood, Keith D
>>>> Cc: MPI 3.0 Remote Memory Access working group; bradc at cray.com
>>>> Subject: Re: [Mpi3-rma] mpi3-rma post from bradc at cray.com requires
>>>> approval
>>>> I see. My definition of ordering was a little bit different from
>> yours.
>>>> My definition was -- if I do two accumulates with replace on the
>> same
>>>> location, I'm guaranteed to have the second value in the location.
>> It
>>>> didn't have any definition of ordering to two different locations.
>>>> So, I think we need to come to a consensus first on what the actual
>>>> definition of ordering is.
>>>>   -- Pavan
>>>> On 06/05/2010 02:22 PM, Underwood, Keith D wrote:
>>>>>>> We would need to think about whether we have to have the whole
>>>>>>> message ordered or ordered on a per target address basis.
>>>>>> Atomicity and ordering go hand-in-hand; if there's no atomicity,
>>>>>> ordering doesn't make sense. Since we have basic datatype
>> atomicity
>>>> for
>>>>>> accumulate/get_accumulate, ordering would make sense at that
>>>>>> granularity
>>>>>> as well.
>>>>>> If someone wants to propose full-message atomicity, then we can
>>>>>> consider
>>>>>> ordering at that granularity too. But till then, whole message
>>>> ordering
>>>>>> is an overkill.
>>>>> Well, they aren't orthogonal, but they aren't quite that tightly
>>>> linked.  A user that knew that two messages were not going to
>> overlap
>>>> might want to use a full message ordering from a single node for
>>>> completion detection.  E.g. an MPI_Accumulate() with "replace" to
>> one
>>>> buffer and then an MPI_Accumulate() to another buffer to increment a
>>>> variable and use the full message ordering to be able to use the
>> latter
>>>> for completion without the expense of a flush() between the
>> messages.
>>>> So, it has value and a usage scenario.  I just don't know if we want
>> to
>>>> go that far or not.
>>>>> Keith
>>>> --
>>>> Pavan Balaji
>>>> http://www.mcs.anl.gov/~balaji
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji

Pavan Balaji

More information about the mpiwg-rma mailing list