[Mpi-forum] [Mpi-22] Another pre-preposal for MPI 2.2 or 3.0
jsquyres at cisco.com
Wed Apr 23 20:18:29 CDT 2008
I think that this is a generally good idea.
As I understand it, you are stating that this is basically a bit
stronger than "hints" -- the word "assertions" carries a bit more of a
connotation that these are strict promises by the user.
On Apr 22, 2008, at 1:38 PM, Richard Treumann wrote:
> I have a proposal for providing information to the MPI
> implementation at MPI_INIT time to allow certain optimizations
> within the run. This is not a "hints" mechanism because it does
> change the semantic rules for MPI in the job run. A correct
> "vanilla" MPI application could give different results or fail if
> faulty information is provided.
> I am interested in what the Forum members think about this idea
> before I try to formalize it.
> I will state up front that I am a skeptic about most of the MPI
> Subset goals I hear described. However, I think this is a form of
> subsetting I would support. I say "I think" because it is possible
> we will find serious complexities that would make me back away.. If
> this looks as straightforward as I expect, perhaps we could look at
> it for MPI 2.2. The most basic valid implementation of this is a
> small amount of work for an implementer. (Well within the scope of
> MPI 2.2 effort / policy)
> The MPI standard has a number of thorny semantic requirements that a
> typical program does not depend on and that an MPI implementation
> may pay a performance penalty by guaranteeing. A standards defined
> mechanism which allows the application to explicitly let libmpi off
> the hook at MPI_Init time on the ones it does not depend on may
> allow better performance in some cases. This would be an "assert"
> rather than a "hints" mechanism because it would be valid for an MPI
> implementation to fail a job that depends on an MPI feature but lets
> libmpi off the hook on it at the MPI_Init call In most, but not all,
> of these cases the MPI implementation could easily give an error
> message if the application did something it had promised not to do.
> Here is a partial list of sometimes troublesome semantic requirements.
> 1) MPI_CANCEL on MPI_ISEND probably cannot be correctly supported
> without adding a message ID to every message sent. Using space in
> the message header adds cost.and may be a complete waste for an
> application that never tries to cancel an ISEND. (If there is a cost
> for being prepared to cancel an MPI_RECV we could cover that too)
> 2) MPI_Datatypes that define a contiguous buffer can be optimized if
> it is known that there will never be a need to translate the data
> between heterogeneous nodes. An array of structures, where each
> structure is a MPI_INT followed by an MPI_FLOAT is likely to be
> contiguous. An MPI_SEND of count==100 can bypass the datatype engine
> and be treated as a send of 800 bytes if the destination has the
> same data representations. An MPI implementation that "knows" it
> will not need to deal with data conversion can simplify the datatype
> commit and internal representation by discarding the MPI_INT/
> MPI_FLOAT data and just recording that the type is 8 bytes with a
> stride of 8.
> 3) The MPI standard either requires or strongly urges that an
> MPI_REDUCE/MPI_ALLREDUCE give exactly the same answer every time. It
> is not clear to me what that means. If it means a portable MPI like
> MPICH or OpenMPI must give the same answer whether run on an Intel
> cluster,an IBM Power cluster or a BlueGene then I would bet no MPI
> in the world complies. If it means Version 5 of an MPI must give the
> same answer Version 1 did, it would prevent new algorithms. However,
> if it means that two "equivalent" reductions in a single application
> run must agree then perhaps most MPIs comply. Whatever it means,
> there are applications that do not need any "same answer" promise as
> long at they can assume they will get a "correct" answer. Maybe they
> can be provided a faster reduction algorithm.
> 4) MPI supports persistent send/recv which could allow some
> optimizations in which half rendezvous, pinned memory for RDMA,
> knowledge that both sides are contiguous buffers etc can be
> leveraged. The ability to do this is damaged by the fact that the
> standard requires a persistent send to match a normal receive and a
> normal send to match a persistent receive. The MPI implementation
> cannot make any assumptions that a matching send_init and recv_init
> can be bound together.
> 5) Perhaps MPI pt2pt communication could use a half rendezvous
> protocol if it were certain no receive would use MPI_ANY_SOURCE. If
> all receives will use an explicit source then libmpi can have the
> receive side send a notice to the send side that a receive is
> waiting. There is no need for the send side to ship the envelop and
> wait for a reply that the match is found. If MPI_ANY_SOURCE is
> possible then the send side must always start the transaction. (I am
> not aware of an issue with MPI_ANY_TAG but maybe somebody can think
> of one)
> 6) It may be that an MPI implementation that is ready to do a spawn
> or join must use a more complex matching/progress engine than it
> would need if it knew the set of connections & networks it had at
> MPI_Init could never be expanded.
> 7) The MPI standard allows a standard send to use an eager protocol
> but requires that libmpi promise every eager message can be buffered
> safely. The MPI implementation must fall back to rendezvous protocol
> when the promise can no longer be kept. This semantic can be
> expensive to maintain and produces serious scaling problems. Some
> applications depend on this semantic but many, especially those
> designed for massive scale, work in ways that ensure libmpi does not
> need to throttle eager sends. The applications pace themselves.
> 8) requirement that multi WAIT/TEST functions accept mixed arrays of
> MPI_Requests ( the multi WAIT/TEST routines may need special
> handling in case someone passes both Isend/Irecv requests and
> MPI_File_ixxx requests to the same MPI_Waitany for example) I bet
> applications seldom do this but is allowed and must work.
> 9) Would an application promise not to use MPI-IO allow any MPI to
> do an optimization?
> 10) Would an application promise not to use MPI-1sided allow any MPI
> to do an optimization?
> 11) What others have I not thought of at all?
> I want to make it clear that none of these MPI_Init time assertions
> should require an MPI implementation that provides the assert ready
> MPI_Init to work differently. For example, the user assertion that
> her application does not depend on a persistent send matching a
> normal receive or normal send matching a persistent receive does not
> require the MPI implementation to suppress such matches. It remains
> the users responsibility to create a program that will still work as
> expected on an MPI implementation that does not change its behavior
> for any specific assertion.
> For some of these it would not be possible for libmpi to detect that
> the user really is depending on something he told us we could shut
> The interface might look like this:
> int MPI_Init_thread_xxx(int *argc, char *((*argv)), int required,
> int *provided, int assertions)
> mpi.h would define constants like this:
> #define MPI_NO_SEND_CANCELS 0x00000001
> #define MPI_NO_ANY_SOURCE 0x00000002
> #define MPI_NO_REDUCE_CONSTRAINT 0x00000004
> #define MPI_NO_DATATYPE_XLATE 0x00000010
> #define MPI_NO_EAGER_THROTLE 0x00000020
> The set of valid assertion flags would be specified by the standard
> as would be their precise meanings. It would always be valid for an
> application to pass 0 (zero) as the assertions argument. It would
> always be valid for an MPI implementation to ignore any or all
> assertions. With a 32 bit integer for assertions, we could define
> the interface in MPI 2.2 and add more assertions in MPI 3.0 if we
> wanted to. We could consider an 64 bit assert to keep the door open
> but I am pretty sure we can get by with 32 distinct assertions.
> A application call would look like: MPI_Init_thread_xxx( 0, 0,
> MPI_THREAD_MULTIPLE, &provided,
> MPI_NO_SEND_CANCELS | MPI_NO_ANY_SOURCE | MPI_NO_DATATYPE_XLATE);
> I am sorry I will not be at the next meeting to discuss in person
> but you can talk to Robert Blackmore.
> Dick Treumann
> Dick Treumann - MPI Team/TCEM
> IBM Systems & Technology Group
> Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846 Fax (845) 433-8363
> mpi-22 mailing list
> mpi-22 at lists.mpi-forum.org
More information about the mpi-forum