[mpi3-coll] Nonblocking collectives standard draft

Fri Nov 14 10:08:14 CST 2008

Hi Torsten,

> Hello Christian,
> thanks for the review of the first draft.
My pleasure!

> All comments that I deleted are fixed,
great

> but I'd like to open the discussion of others up.
of course

>> btw. all line numbers within the MPI standard are not really line 
>> numbers (they have equal distances and as soon as there is another font 
>> size - like for the interface definitions - then you end up anywhere...)
>>   It's like "look at page X at _centimetre_ Y".
> yes, I sent a note about this to the MPI-2.1 mailinglist a while ago,
> but saw no reaction.
too bad :-(

>> p 49, l 46: it is not enough to state "receive buffer must not be read"
>>   it should be forbidden to access the receive buffer in any way
>>   (imagine an implementation that uses the buffer to forwards sth.)
> Right, I changed this to be similar to MPI-2.1 p52 line 4-7. Now it
> reads "must not be accessed".
much better

>> p 50, l 7: we should discuss MPI_REQUEST_FREE (again) - I talked to some
>>   colleagues and we see application scenarios where this might be
>>   useful (e.g., if only a subgroup participates in an irregular
>>   collective); because this is a local operation, we also don't
>>   see any problems by generally allowing MPI_REQUEST_FREE for
>>   nonblocking collective operations
> I personally don't like request free
I also don't like MPI_REQUEST_FREE - but nevertheless this function is 
already in the MPI standard. Purely subjective matters should not be 
used for such decisions as long as there are good objective ones. So as 
long as this function will not be deprecated for the whole MPI standard, 
we are primarily forced to extend its definition also for nonblocking 
collective operations.
> because it makes thread-safe implementations really hard.
Can you please elaborate on this issue. I currently don't see any 
problems related to threads...
> Also, I don't see a good portable way to check/ensure that the
> operation completed on the receiver side (we had  this
> discussion before on the MPI-2.1 ML I think). Do you know one?
Just because we can't see a way at this time does not mean that no way 
exists. Message passing typically consists of two parts: data 
transmission and synchronization. If the synchronization is already done 
through a side channel, then we don't need to duplicate this work. An 
example is a scenario where the master broadcasts some tasks to the 
slaves and then gathers the final results. We don't necessarily need the 
synchronization part (i.e. MPI_Wait) for the broadcast because once the 
master receives all results it is clear that all tasks have been 
transmitted to the slaves. Another example includes a barrier where only 
half of the processes want to participate - so instead of creating a new 
communicator (expensive), we could simply call MPI_REQUEST_FREE (likely 
much cheaper) at the uninterested ranks. The other examples with the 
irregular collectives where some processes supply zero-size entries 
still hold. They can also free the request without waiting for any 
completion. I don't recommend to do it like in these examples, but they 
show that there _might_ be certain use-cases. I don't like to forbid 
something that is already defined without any good reason - especially 
since this would increase the number of exceptions in MPI which the user 
must be aware of.

>> p 50, l 19: maybe we should be even more precise and add after 
>> "environments." something like "It is therefore erroneous to start a set 
>> of nbc in different logical orders on the participating ranks."
> we didn't really define "different logical order" anywhere. This is why
> we decided to go by examples.
ok

>>   The sentence in l 20 ("The behavior...") can and should be deleted.
>>   "Nbc operations and bc operations [+are totally unrelated and ]
>>    do not match each other."
> addition not made ("totally unrelated") - I think we shouldn't say this
> sinde they perform very similar operations and can be implemented on top
> of each other.
maybe - however, it might be a good idea to provide a hint to the reader 
_why_ we decided that bc and nbc don't match each other. That they can 
(and for performance reasons should) be implemented differently is such 
a (hopefully) understandable hint.

>> several locations: "memory movement" should be replaced by "data movement"
>>   (but "movement" sounds bad anyway because you say
>>    "after completion" - maybe "data placement" would be
>>    more appropriate)
> the MPI standard says for example "The ``in place'' operations are
> provided to reduce unnecessary memory motion". I agree that it sounds
> strange, but it seems to be the used like this in MPI terms. But I am
> open to change it.
``in place'' seems like something else to me because it specifies purely 
local semantics whereas collectives define remote correlations. But this 
seems more like a question for language experts or at least native 
speakers - so I'll leave this open from my side and hope that someone 
else jumps in ...

>> p 61, l 31.8 & p 62, l 27.5: "after it completion" -> "after completion"?
> I don't see a mistake there. It says "after it completed" in the text.
yes, sorry for the typo - it just sounds "bad"

>> btw. there seems to be a potential naming problem for beginners:
>>   "Exclusive Scan -> Exscan" && "Inclusive Scan -> Iscan"???
> Hmm, I wouldn't think so, but we can discuss this.
This is just a minor issue and can possibly be neglected.

>> p 69, l 41: "in SPMD style programs" seems superflous to me
>>   but "double buffering" could also be added as an example ...
> it's not superflous. One can also implement task-parallel or other
> programming paradigms with MPI.
I agree that there are other programming paradigms. But why can 
pipelining techniques only be used in SPMD style programs? I know of at 
least one application (similar to POP) that uses multiple programs but 
nevertheless uses MPI collective operations. And it's not unlikely that 
it could also benefit from NBC by applying pipelining/double buffering 
techniques.

> I added double-bufferng (even though
> this is just a weaker form of pipelining).
good point

>> Example 5.33 seems to be very dangerous for at least two reasons:
>>   2) this doesn't make it a "global" allreduce because
>>      (depending how sbuf and rbuf are initialized) all
>>      seem to calculate different results
>>      (0+0+1+2 vs. 0+1+1+2 vs. 1+2+0+2)
> exactly - that's what they want to do.
Then I didn't understand the example. To me the text suggested that this 
  approach might be used to simulate a global collective operation 
without building a combined communicator...

 > Everybody wants to reduce in two communicators.
... ok, then it's fine

> This tiny example is probably not very useful,
that was the problem ;-)

>  but easy to understand.
If an example doesn't make any sense, then it can also be anything but 
easy to understand. The example could be more helpful if it would be 
really useful instead of confusing.

>> What do we want to have (either or):
>>   a) the same ordering over both blocking and nonblocking collectives
>>      (suggested by Example 5.29)
>>   b) blocking and nonblocking collectives are completely independent
>>      (suggested by "reserved tag-space" p 50, l 15)
>> Upon a decision, the related text needs some clarification.
> we decided to go with a). The advice to implementors talks only about
> non-blocking point-to-point when it mentions the reserved tag space.
ok (this important point should be stated more clearly somewhere)

I also noted that there is currently no statement that describes how an 
implementation has to deal with the MPI_STATUS object.

> Thank you very much for your review!
Thanks for your draft!

Have a nice weekend.

    Christian

-- 
Christian Siebert, Dipl.-Inf.               Research Associate

            NEC Laboratories Europe, NEC Europe Ltd.
        Rathausallee 10, D-53757 Sankt Augustin, Germany

Phone: +49 (0) 2241 - 92 52 44    Fax: +49 (0) 2241 - 92 52 99

  (Registered Office: 1 Victoria Road, London W3 6BL, 2832014)