[Mpi-22] Another pre-preposal for MPI 2.2 or 3.0

Wed Apr 23 20:18:29 CDT 2008

I think that this is a generally good idea.

As I understand it, you are stating that this is basically a bit  
stronger than "hints" -- the word "assertions" carries a bit more of a  
connotation that these are strict promises by the user.

On Apr 22, 2008, at 1:38 PM, Richard Treumann wrote:

> I have a proposal for providing information to the MPI  
> implementation at MPI_INIT time to allow certain optimizations  
> within the run. This is not a "hints" mechanism because it does  
> change the semantic rules for MPI in the job run. A correct  
> "vanilla" MPI application could give different results or fail if  
> faulty information is provided.
>
> I am interested in what the Forum members think about this idea  
> before I try to formalize it.
>
> I will state up front that I am a skeptic about most of the MPI  
> Subset goals I hear described. However, I think this is a form of  
> subsetting I would support. I say "I think" because it is possible  
> we will find serious complexities that would make me back away.. If  
> this looks as straightforward as I expect, perhaps we could look at  
> it for MPI 2.2. The most basic valid implementation of this is a  
> small amount of work for an implementer. (Well within the scope of  
> MPI 2.2 effort / policy)
>
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
>
> The MPI standard has a number of thorny semantic requirements that a  
> typical program does not depend on and that an MPI implementation  
> may pay a performance penalty by guaranteeing. A standards defined  
> mechanism which allows the application to explicitly let libmpi off  
> the hook at MPI_Init time on the ones it does not depend on may  
> allow better performance in some cases. This would be an "assert"  
> rather than a "hints" mechanism because it would be valid for an MPI  
> implementation to fail a job that depends on an MPI feature but lets  
> libmpi off the hook on it at the MPI_Init call In most, but not all,  
> of these cases the MPI implementation could easily give an error  
> message if the application did something it had promised not to do.
>
> Here is a partial list of sometimes troublesome semantic requirements.
>
> 1) MPI_CANCEL on MPI_ISEND probably cannot be correctly supported  
> without adding a message ID to every message sent. Using space in  
> the message header adds cost.and may be a complete waste for an  
> application that never tries to cancel an ISEND. (If there is a cost  
> for being prepared to cancel an MPI_RECV we could cover that too)
>
> 2) MPI_Datatypes that define a contiguous buffer can be optimized if  
> it is known that there will never be a need to translate the data  
> between heterogeneous nodes.   An array of structures, where each  
> structure is a MPI_INT followed by an MPI_FLOAT is likely to be  
> contiguous. An MPI_SEND of count==100 can bypass the datatype engine  
> and be treated as a send of 800 bytes if the destination has the  
> same data representations. An MPI implementation that "knows" it  
> will not need to deal with data conversion can simplify the datatype  
> commit and internal representation by discarding the MPI_INT/ 
> MPI_FLOAT data and just recording that the type is 8 bytes with a  
> stride of 8.
>
> 3) The MPI standard either requires or strongly urges that an  
> MPI_REDUCE/MPI_ALLREDUCE give exactly the same answer every time. It  
> is not clear to me what that means. If it means a portable MPI like  
> MPICH or OpenMPI must give the same answer whether run on an Intel  
> cluster,an IBM Power cluster or a BlueGene then I would bet no MPI  
> in the world complies. If it means Version 5 of an MPI must give the  
> same answer Version 1 did, it would prevent new algorithms. However,  
> if it means that two "equivalent" reductions in a single application  
> run must agree then perhaps most MPIs comply. Whatever it means,  
> there are applications that do not need any "same answer" promise as  
> long at they can assume they will get a "correct" answer. Maybe they  
> can be provided a faster reduction algorithm.
>
> 4) MPI supports persistent send/recv which could allow some  
> optimizations in which half rendezvous, pinned memory for RDMA,  
> knowledge that both sides are contiguous buffers etc can be  
> leveraged. The ability to do this is damaged by the fact that the  
> standard requires a persistent send to match a normal receive and a  
> normal send to match a persistent receive. The MPI implementation  
> cannot make any assumptions that a matching send_init and recv_init  
> can be bound together.
>
> 5) Perhaps MPI pt2pt communication could use a half rendezvous  
> protocol if it were certain no receive would use MPI_ANY_SOURCE. If  
> all receives will use an explicit source then libmpi can have the  
> receive side send a notice to the send side that a receive is  
> waiting. There is no need for the send side to ship the envelop and  
> wait for a reply that the match is found. If MPI_ANY_SOURCE is  
> possible then the send side must always start the transaction. (I am  
> not aware of an issue with MPI_ANY_TAG but maybe somebody can think  
> of one)
>
> 6) It may be that an MPI implementation that is ready to do a spawn  
> or join must use a more complex matching/progress engine than it  
> would need if it knew the set of connections & networks it had at  
> MPI_Init could never be expanded.
>
> 7) The MPI standard allows a standard send to use an eager protocol  
> but requires that libmpi promise every eager message can be buffered  
> safely. The MPI implementation must fall back to rendezvous protocol  
> when the promise can no longer be kept. This semantic can be  
> expensive to maintain and produces serious scaling problems. Some  
> applications depend on this semantic but many, especially those  
> designed for massive scale, work in ways that ensure libmpi does not  
> need to throttle eager sends. The applications pace themselves.
>
> 8) requirement that multi WAIT/TEST functions accept mixed arrays of  
> MPI_Requests ( the multi WAIT/TEST routines may need special  
> handling in case someone passes both Isend/Irecv requests and  
> MPI_File_ixxx requests to the same MPI_Waitany for example) I bet  
> applications seldom do this but is allowed and must work.
>
> 9) Would an application promise not to use MPI-IO allow any MPI to  
> do an optimization?
>
> 10) Would an application promise not to use MPI-1sided allow any MPI  
> to do an optimization?
>
> 11) What others have I not thought of at all?
>
> I want to make it clear that none of these MPI_Init time assertions  
> should require an MPI implementation that provides the assert ready  
> MPI_Init to work differently. For example, the user assertion that  
> her application does not depend on a persistent send matching a  
> normal receive or normal send matching a persistent receive does not  
> require the MPI implementation to suppress such matches. It remains  
> the users responsibility to create a program that will still work as  
> expected on an MPI implementation that does not change its behavior  
> for any specific assertion.
>
> For some of these it would not be possible for libmpi to detect that  
> the user really is depending on something he told us we could shut  
> off.
>
> The interface might look like this:
> int MPI_Init_thread_xxx(int *argc, char *((*argv)[]), int required,  
> int *provided, int assertions)
>
> mpi.h would define constants like this:
>
> #define MPI_NO_SEND_CANCELS 0x00000001
> #define MPI_NO_ANY_SOURCE 0x00000002
> #define MPI_NO_REDUCE_CONSTRAINT 0x00000004
> #define MPI_NO_DATATYPE_XLATE 0x00000010
> #define MPI_NO_EAGER_THROTLE 0x00000020
> etc
>
> The set of valid assertion flags would be specified by the standard  
> as would be their precise meanings. It would always be valid for an  
> application to pass 0 (zero) as the assertions argument. It would  
> always be valid for an MPI implementation to ignore any or all  
> assertions. With a 32 bit integer for assertions, we could define  
> the interface in MPI 2.2 and add more assertions in MPI 3.0 if we  
> wanted to. We could consider an 64 bit assert to keep the door open  
> but I am pretty sure we can get by with 32 distinct assertions.
>
>
> A application call would look like: MPI_Init_thread_xxx( 0, 0,  
> MPI_THREAD_MULTIPLE, &provided,
> MPI_NO_SEND_CANCELS | MPI_NO_ANY_SOURCE | MPI_NO_DATATYPE_XLATE);
>
> I am sorry I will not be at the next meeting to discuss in person  
> but you can talk to Robert Blackmore.
>
>
>
>
> Dick Treumann
> Dick Treumann - MPI Team/TCEM
> IBM Systems & Technology Group
> Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846 Fax (845) 433-8363
> _______________________________________________
> mpi-22 mailing list
> mpi-22_at_[hidden]
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-22

-- 
Jeff Squyres
Cisco Systems