[Mpi3-subsetting] MPI subsetting: charting the way forward at atelecon next week?

Richard Treumann treumann at [hidden]
Fri Jun 20 08:56:43 CDT 2008


Hi Alexander

Comments imbedded below.

I have no objections to someone providing a rationale for assertions
related to MPI-IO and MPI_1sided.  If the rationale is sound I have no
objection to putting them in the proposal.

I feel the proposal should be evaluated by the following algorithm.

If (this concept  is one that seems plausible) {
    for each proposed assertion {
          if (rationale not solid)
             discard
          if (deal breaker downside)
             discard
    }
if ((concept makes sense) & (set of worthwhile assertions is not empty))
   make this part of MPI 2.2

I do not see much reason to get every assertion that eventually gains
traction into MPI 2.2.  MPI 3.0 is soon enough for any that do not make the
MPI 2.2 cut. I do not want to see the concept fall because some particular
assertion is controversial.

I consider MPI_NO_EAGER_THROTTLE to be the single most valuable assertion
for MPI 2.2 because it is needed to allow MPI to scale to the levels we are
already seeing.

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

mpi3-subsetting-bounces_at_[hidden] wrote on 06/20/2008 02:58:41
AM:

> Dear Dick,
>
> A couple of suggestions re your proposal:
>
> - If ASSERTIONS is put at the end of the MPI_INIT_ASSERTED argument
> list, in C++ one can declare the last argument as having a zero
> default value, and skip it if necessary. This might help with
> deprecation of the earlier MPI_INIT_* calls.

I have no objection. It seems reasonable to let C++ default the
assertions parameter to "none"

> - In non-Cray parts of the world, an MPI_INT followed by MPI_FLOAT
> is likely to be a 4-byte int followed by a 4-byte float. This
> sometimes depends on the compiler settings in effect, too.

My rationale is not specific to any particular architecture.
Some MPI datatypes are made entirely
from the same base type. Some are mixtures of types. If libmpi knows
at the moment a datatype is committed that the send side and receive
side will always use the same internal representions then it does not
need to keep track of the fact that one instance of {MPI_INT,MPI_FLOAT}
has two distinct parts. The send side can gather and ship 8 bytes
and the receive side can scatter the 8 bytes. If one side might use 4
byte integers while the other side uses 8 byte integers then at
least one side will need to know there is a conversion to be done for
the MPI_INT part. If an MPI job does a spawn or join that links to a
different architecture after the datatype has been committed, and
the MPI_Type_commit has discarded the details, it is too late to get
them back.  On the other hand, if it is known there will never be a
different architecture added to the job, the extra information can be
safely discarded.

> - I don't think MPI_NO_THREAD_CONTENTION is really necessary. The
> original thread level settings, in particular, the use of anything
> but MPI_THREAD_MULTIPLE, seem to capture the semantics that you proposed.

This one is kind of tricky and I also am not sure what it would mean. If
we find a clear value we can keep it and if not we can remove it.

> - I can't fully follow the motivation for MPI_NO_ANY_SOURCE
> deprioritization. AFAIK, a rendezvous exchange usually starts with a
> ready-to-send packet that contains the size of the message. In this
> case the receiving side will normally reply with a ready-to-receive
> regardless of the buffer space available, and flag MPI_ERR_TRUNCATED
> on message arrival if necessary. In this case, neither
> MPI_ANY_SOURCE not MPI_NO_ANY_SOURCE seem to get into way.

My point is that MPI_NO_ANY_SOURCE might allow this round trip
protocol to be replaced by a 1/2 rendezvous protocol. If it is known
that MPI_ANY_SOURCE will not be used then the receive side can send
an "envelop and ready for data" packet to the send side. As long as
the send side knows it will receive the "envelop and ready for data"
packet when the receive is posted, it does not need to do the first 1/2
of the rendezvous. The message matching can be done at the send side.

A send for which the receive was preposted has a
good chance of finding the "envelop and ready for data" sitting in
an early queue and the large send can avoid any rendezvous delay.
Data begins to flow immediately vs waiting for a round trip of a
full rendezvous. In many cases we cut the delay in half and best
case we eliminate rendezvous delay completely. If the receive side
is late in posting the receive we still save a packet traversal but
do not save any time.

If there may be an MPI_ANY_SOURCE then this does not work because the
receive side that has an MPI_ANY_SOURCE cannot guess which sender to
notify so the sender cannot count on getting a 1/2 rendezvous
notification for a message that should match the MPI_ANY_SOURCE
receive.

The problem that made me lower the priority is that many MPIs use an
eager protocol for small messages and a rendezvous protocol for large
messages.  If the send side and receive side have the same size buffer
then both sides can reach the same conclusion: eager vs 1/2 rendezvous.
If both decide on eager, the receive side will not send an
"envelop and ready for data" packet and the send side will not look
for one. If both sides decide on 1/2 rendezvous then the receive side
will send an "envelop and ready for data" packet and the send side will
look for and consume the notice.  If the send side is for an 8 byte
message and the receive uses a "big enough" receive buffer of 64KB
then the two sides will probably not be able to reach the same
conclusion about the protocol. The receive side will ship off an
"envelop and ready for data" packet that the send side will not
know what to do with.

>
> Best regards.
>
> Alexander
>
> From: Supalov, Alexander
> Sent: Friday, June 20, 2008 8:29 AM
> To: 'MPI 3.0 Sub-setting working group'
> Subject: RE: [Mpi3-subsetting] MPI subsetting: charting the way
> forward at atelecon next week?

> Dear Dick,
>
> Thank you. I remember we exchanged a couple of emails about the
> possible extensions to the set of assertions, like one-sided and
> I/O, and in my recollection, almost reached an agreement that this
> can improve performance and possibly memory footprint, as well as be
> expressed thru assertions. Do you still feel favorable about this?
>
> Best regards.
>
> Alexander
>





* 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi3-subsetting/attachments/20080620/61af2f8d/attachment.html>


More information about the Mpi3-subsetting mailing list