[Mpi-forum] Another pre-preposal for MPI 2.2 or 3.0

Tue Apr 22 12:38:03 CDT 2008

I have a proposal for providing information to the MPI implementation at
MPI_INIT time to allow certain optimizations within the run.  This is not a
"hints" mechanism because it does change the semantic rules for MPI in the
job run.  A correct "vanilla" MPI application could give different results
or fail if faulty information is provided.

I am interested in what the Forum members think about this idea before I
try to formalize it.

I will state up front that I am a skeptic about most of the MPI Subset
goals I hear described.  However, I think this is a form of subsetting I
would support.  I say "I think" because it is possible we will find serious
complexities that would make me back away.. If this looks as
straightforward as I expect, perhaps we could look at it for MPI 2.2.  The
most basic valid implementation of this is a small amount of work for an
implementer. (Well within the scope of MPI 2.2 effort / policy)

==========================================================================================

The MPI standard has a number of thorny semantic requirements that a
typical program does not depend on and that an MPI implementation may pay a
performance penalty by guaranteeing. A standards defined mechanism which
allows the application to explicitly let libmpi off the hook at MPI_Init
time on the ones it does not depend on may allow better performance in some
cases.  This would be an "assert" rather than a "hints" mechanism because
it would be valid for an MPI implementation to fail a job that depends on
an MPI feature but lets libmpi off the hook on it at the MPI_Init call In
most, but not all, of these cases the MPI implementation could easily give
an error message if the application did something it had promised not to
do.

 Here is a partial list of sometimes troublesome semantic requirements.

1) MPI_CANCEL on MPI_ISEND probably cannot be correctly supported without
adding a message ID to every message sent. Using space in the message
header adds cost.and may be a complete waste for an application that never
tries to cancel an ISEND. (If there is a cost for being prepared to cancel
an MPI_RECV we could cover that too)

2) MPI_Datatypes that define a contiguous buffer can be optimized if it is
known that there will never be a need to translate the data between
heterogeneous nodes.   An array of structures, where each structure is a
MPI_INT followed by an MPI_FLOAT is likely to be contiguous.  An MPI_SEND
of count==100 can bypass the datatype engine and be treated as a send of
800 bytes if the destination has the same data representations.  An MPI
implementation that "knows" it will not need to deal with data conversion
can simplify the datatype commit and internal representation by discarding
the MPI_INT/MPI_FLOAT data and just recording that the type is 8 bytes with
a stride of 8.

3) The MPI standard either requires or strongly urges that an
MPI_REDUCE/MPI_ALLREDUCE give exactly the same answer every time.  It is
not clear to me what that means. If it means a portable MPI like MPICH or
OpenMPI must give the same answer whether run on an Intel cluster,an IBM
Power cluster or a BlueGene then I would bet no MPI in the world complies.
If it means Version 5 of an MPI must give the same answer Version 1 did, it
would prevent new algorithms. However, if it means that two "equivalent"
reductions in a single application run must agree then perhaps most MPIs
comply. Whatever it means, there are applications that do not need any
"same answer" promise as long at they can assume they will get a "correct"
answer. Maybe they can be provided a faster reduction algorithm.

4) MPI supports persistent send/recv which could allow some optimizations
in which half rendezvous, pinned memory for RDMA, knowledge that both sides
are contiguous buffers etc can be leveraged.  The ability to do this is
damaged by the fact that the standard requires a persistent send to match a
normal receive and a normal send to match a persistent receive.  The MPI
implementation cannot make any assumptions that a matching send_init and
recv_init can be bound together.

5) Perhaps MPI pt2pt communication could use a half rendezvous protocol if
it were certain no receive would use MPI_ANY_SOURCE.  If all receives will
use an explicit source then libmpi can have the receive side send a notice
to the send side that a receive is waiting.  There is no need for the send
side to ship the envelop and wait for a reply that the match is found.  If
MPI_ANY_SOURCE is possible then the send side must always start the
transaction. (I am not aware of an issue with MPI_ANY_TAG but maybe
somebody can think of one)

6) It may be that an MPI implementation that is ready to do a spawn or join
must use a more complex matching/progress engine than it would need if it
knew the set of connections & networks it had at MPI_Init could never be
expanded.

7) The MPI standard allows a standard send to use an eager protocol but
requires that libmpi promise every eager message can be buffered safely.
The MPI implementation must fall back to rendezvous protocol when the
promise can no longer be kept. This semantic can be expensive to maintain
and produces serious scaling problems. Some applications depend on this
semantic but many, especially those designed for massive scale, work in
ways that ensure libmpi does not need to throttle eager sends. The
applications pace themselves.

8) requirement that multi WAIT/TEST functions accept mixed arrays of
MPI_Requests ( the multi WAIT/TEST routines may need special handling in
case someone passes both Isend/Irecv requests and MPI_File_ixxx requests to
the same MPI_Waitany for example) I bet applications seldom do this but is
allowed and must work.

9) Would an application promise not to use MPI-IO allow any MPI to do an
optimization?

10) Would an application promise not to use MPI-1sided allow any MPI to do
an optimization?

11) What others have I not thought of at all?

I want to make it clear that none of these MPI_Init time assertions should
require an MPI implementation that provides the assert ready MPI_Init to
work differently. For example, the user assertion that her application does
not depend on a persistent send matching a normal receive or normal send
matching a persistent receive does not require the MPI implementation to
suppress such matches.  It remains the users responsibility to create a
program that will still work as expected on an MPI implementation that does
not change its behavior for any specific assertion.

For some of these it would not be possible for libmpi to detect that the
user really is depending on something he told us we could shut off.

The interface might look like this:
    int MPI_Init_thread_xxx(int *argc, char *((*argv)[]), int required, int
*provided, int assertions)

mpi.h would define constants like this:

#define MPI_NO_SEND_CANCELS                           0x00000001
#define MPI_NO_ANY_SOURCE                               0x00000002
#define MPI_NO_REDUCE_CONSTRAINT               0x00000004
#define MPI_NO_DATATYPE_XLATE                       0x00000010
#define MPI_NO_EAGER_THROTLE                         0x00000020
etc

The set of valid assertion flags would be specified by the standard as
would be their precise meanings.  It would always be valid for an
application to pass 0 (zero) as the assertions argument.  It would always
be valid for an MPI implementation to ignore any or all assertions. With a
32 bit integer for assertions, we could define the interface in MPI 2.2 and
add more assertions in MPI 3.0 if we wanted to.  We could consider an 64
bit assert to keep the door open but I am pretty sure we can get by with 32
distinct assertions.

A application call would look like: MPI_Init_thread_xxx( 0, 0,
MPI_THREAD_MULTIPLE, &provided,
                                        MPI_NO_SEND_CANCELS |
MPI_NO_ANY_SOURCE | MPI_NO_DATATYPE_XLATE);

I am sorry I will not be at the next meeting to discuss in person but you
can talk to Robert Blackmore.

                      Dick Treumann
Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20080422/b27727a9/attachment.html>