[mpi3-coll] Nonblocking collectives standard draft

Fri Nov 14 10:55:08 CST 2008

Hi Christian,
> >>p 50, l 7: we should discuss MPI_REQUEST_FREE (again) - I talked to some
> >>  colleagues and we see application scenarios where this might be
> >>  useful (e.g., if only a subgroup participates in an irregular
> >>  collective); because this is a local operation, we also don't
> >>  see any problems by generally allowing MPI_REQUEST_FREE for
> >>  nonblocking collective operations
> >I personally don't like request free
> I also don't like MPI_REQUEST_FREE - but nevertheless this function is 
> already in the MPI standard. Purely subjective matters should not be 
> used for such decisions as long as there are good objective ones. So as 
> long as this function will not be deprecated for the whole MPI standard, 
> we are primarily forced to extend its definition also for nonblocking 
> collective operations.
I never decided anything subjective. We had this discussion three times
on our telecons and nobody spoke up for it. I also aked it two times to
the Forum and nobody wanted to "rescue" request-free.

> >because it makes thread-safe implementations really hard.
> Can you please elaborate on this issue. I currently don't see any 
> problems related to threads...
I will explain offline (something higher bandwidth - e.g., phone). If
you're really interested look at this year's EuroPVM paper about thread
safety.

> >Also, I don't see a good portable way to check/ensure that the
> >operation completed on the receiver side (we had  this
> >discussion before on the MPI-2.1 ML I think). Do you know one?
> Just because we can't see a way at this time does not mean that no way 
> exists. 
that sounds rather random :). Would this not also be an argument for
MPI_Flush_toilet :).

> Message passing typically consists of two parts: data 
> transmission and synchronization. If the synchronization is already done 
> through a side channel, then we don't need to duplicate this work. 
Yes, but MPI is supposed to be portable, also to machines that are not
cache coherent and/or use heavy message buffering. See the previous
discussion for examples that break request-free.

> An example is a scenario where the master broadcasts some tasks to the
> slaves and then gathers the final results. We don't necessarily need
> the synchronization part (i.e. MPI_Wait) for the broadcast because
> once the master receives all results it is clear that all tasks have
> been transmitted to the slaves. 
ok, that could/would work I think. Savings to avoid the local completion
check would be minimal. Note that the synchronization is only local
because the gather will synchronize anyway.

> Another example includes a barrier where only 
> half of the processes want to participate - so instead of creating a new 
> communicator (expensive), we could simply call MPI_REQUEST_FREE (likely 
> much cheaper) at the uninterested ranks. 
no, they would still need to participate before the barrier completes,
i.e., you would introduce unnecessary synchronization. And the
programming style seems awful. We should discourage such a usage.

> The other examples with the irregular collectives where some processes
> supply zero-size entries still hold. They can also free the request
> without waiting for any completion. 
same as the barrier - but yes, it could be valid. I don't want to think
of the effects of missing progress (i.e., if one of those "freed" nodes 
does not call MPI for another 30 minutes, all others might wait).

> I don't recommend to do it like in these examples, but they show that
> there _might_ be certain use-cases. 
Programmers will use everything that you give them. The design of a good
interface also means that pitfalls and bad side-effects are avoided.

> I don't like to forbid something that is already defined without any
> good reason - especially since this would increase the number of
> exceptions in MPI which the user must be aware of.
that's a good argument. I think we should reconsider to deprecate
request-free unless we have a really good use-case and not just "there
might be people who use it" - that does not mean anything. I'm sure
there are still people who use goto to implement for loops :).

I added a revisit of the discussions about request-free to the next
meeting agenda.

> >>  The sentence in l 20 ("The behavior...") can and should be deleted.
> >>  "Nbc operations and bc operations [+are totally unrelated and ]
> >>   do not match each other."
> >addition not made ("totally unrelated") - I think we shouldn't say this
> >sinde they perform very similar operations and can be implemented on top
> >of each other.
> maybe - however, it might be a good idea to provide a hint to the reader 
> _why_ we decided that bc and nbc don't match each other. That they can 
> (and for performance reasons should) be implemented differently is such 
> a (hopefully) understandable hint.
ok, that sounds reasonable. I added a rationale and an advice to users
to the draft.

> >>several locations: "memory movement" should be replaced by "data movement"
> >>  (but "movement" sounds bad anyway because you say
> >>   "after completion" - maybe "data placement" would be
> >>   more appropriate)
> >the MPI standard says for example "The ``in place'' operations are
> >provided to reduce unnecessary memory motion". I agree that it sounds
> >strange, but it seems to be the used like this in MPI terms. But I am
> >open to change it.
> ``in place'' seems like something else to me because it specifies purely 
> local semantics whereas collectives define remote correlations. But this 
> seems more like a question for language experts or at least native 
> speakers - so I'll leave this open from my side and hope that someone 
> else jumps in ...
I added it to the agende for the next meeting

> >>p 61, l 31.8 & p 62, l 27.5: "after it completion" -> "after completion"?
> >I don't see a mistake there. It says "after it completed" in the text.
> yes, sorry for the typo - it just sounds "bad"
could you elaborate or propose a rephrasing?

> >>p 69, l 41: "in SPMD style programs" seems superflous to me
> >>  but "double buffering" could also be added as an example ...
> >it's not superflous. One can also implement task-parallel or other
> >programming paradigms with MPI.
> I agree that there are other programming paradigms. But why can 
> pipelining techniques only be used in SPMD style programs? I know of at 
> least one application (similar to POP) that uses multiple programs but 
> nevertheless uses MPI collective operations. And it's not unlikely that 
> it could also benefit from NBC by applying pipelining/double buffering 
> techniques.
the text does not exclude other uses. It just serves as an easy to
understand example I think.

> >>Example 5.33 seems to be very dangerous for at least two reasons:
> >>  2) this doesn't make it a "global" allreduce because
> >>     (depending how sbuf and rbuf are initialized) all
> >>     seem to calculate different results
> >>     (0+0+1+2 vs. 0+1+1+2 vs. 1+2+0+2)
> >exactly - that's what they want to do.
> Then I didn't understand the example. To me the text suggested that this 
>  approach might be used to simulate a global collective operation 
> without building a combined communicator...
no, that would be useless. I just want to demonstrate collectives on
overlapping communicators. This was requested by Rolf during the last
meeting, and I like the idea. But we can work on the example if you have
issues with it.

> >This tiny example is probably not very useful,
> that was the problem ;-)
feel free to propose something more useful :).

> > but easy to understand.
> If an example doesn't make any sense, then it can also be anything but 
> easy to understand. The example could be more helpful if it would be 
> really useful instead of confusing.
see above

> >>What do we want to have (either or):
> >>  a) the same ordering over both blocking and nonblocking collectives
> >>     (suggested by Example 5.29)
> >>  b) blocking and nonblocking collectives are completely independent
> >>     (suggested by "reserved tag-space" p 50, l 15)
> >>Upon a decision, the related text needs some clarification.
> >we decided to go with a). The advice to implementors talks only about
> >non-blocking point-to-point when it mentions the reserved tag space.
> ok (this important point should be stated more clearly somewhere)
I extended lines 17-22 on page 50.

> I also noted that there is currently no statement that describes how an 
> implementation has to deal with the MPI_STATUS object.
yes, this is on the agenda for the next meeting/telecon.

I will upload a new version later today.

Thanks & Best,
  Torsten

-- 
 bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
Torsten Hoefler       | Postdoctoral Researcher
Open Systems Lab      | Indiana University    
150 S. Woodlawn Ave.  | Bloomington, IN, 474045, USA
Lindley Hall Room 135 | +01 (812) 855-3608