[Mpi-forum] [Mpi-22] Another pre-preposal for MPI 2.2 or 3.0

Supalov, Alexander alexander.supalov at intel.com
Thu Apr 24 03:13:19 CDT 2008


What happens if we run beyond 32 or 64 attributes? I think we may rather
need something more scalable than an int, and possibly more hierarchical
than a linear list of attributes. That would map into subsets nicely, by
the way.

Another thing is that in some cases, the attitude of the MPI for each
attribute may be "yes", "no", and "don't care/undefined". I can imagine,
for example, that there's no eager protocol at all, and so no throttle,
albeit in a way different from when there are eager and rendezvous
protocols, but they are well tuned to provide a smooth curve. What will
happen in either case: will MPI proceed or terminate? By having
attributes with values "yes", "no", "tell me" we may be able to
accommodate this easier than with the bitwise "yes" and "no".

Finally, we'll we treat thread support level as yet another attribute?
Will we define any query function for these attributes? Will they be
job-wide or communicator-wide?

Best regards.


-----Original Message-----
From: mpi-forum-bounces at lists.mpi-forum.org
[mailto:mpi-forum-bounces at lists.mpi-forum.org] On Behalf Of Jeff Squyres
Sent: Thursday, April 24, 2008 3:18 AM
To: MPI 2.2
Cc: mpi-forum at lists.mpi-forum.org
Subject: Re: [Mpi-forum] [Mpi-22] Another pre-preposal for MPI 2.2 or

I think that this is a generally good idea.

As I understand it, you are stating that this is basically a bit  
stronger than "hints" -- the word "assertions" carries a bit more of a  
connotation that these are strict promises by the user.

On Apr 22, 2008, at 1:38 PM, Richard Treumann wrote:

> I have a proposal for providing information to the MPI  
> implementation at MPI_INIT time to allow certain optimizations  
> within the run. This is not a "hints" mechanism because it does  
> change the semantic rules for MPI in the job run. A correct  
> "vanilla" MPI application could give different results or fail if  
> faulty information is provided.
> I am interested in what the Forum members think about this idea  
> before I try to formalize it.
> I will state up front that I am a skeptic about most of the MPI  
> Subset goals I hear described. However, I think this is a form of  
> subsetting I would support. I say "I think" because it is possible  
> we will find serious complexities that would make me back away.. If  
> this looks as straightforward as I expect, perhaps we could look at  
> it for MPI 2.2. The most basic valid implementation of this is a  
> small amount of work for an implementer. (Well within the scope of  
> MPI 2.2 effort / policy)
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ======================================================================
> The MPI standard has a number of thorny semantic requirements that a  
> typical program does not depend on and that an MPI implementation  
> may pay a performance penalty by guaranteeing. A standards defined  
> mechanism which allows the application to explicitly let libmpi off  
> the hook at MPI_Init time on the ones it does not depend on may  
> allow better performance in some cases. This would be an "assert"  
> rather than a "hints" mechanism because it would be valid for an MPI  
> implementation to fail a job that depends on an MPI feature but lets  
> libmpi off the hook on it at the MPI_Init call In most, but not all,  
> of these cases the MPI implementation could easily give an error  
> message if the application did something it had promised not to do.
> Here is a partial list of sometimes troublesome semantic requirements.
> 1) MPI_CANCEL on MPI_ISEND probably cannot be correctly supported  
> without adding a message ID to every message sent. Using space in  
> the message header adds cost.and may be a complete waste for an  
> application that never tries to cancel an ISEND. (If there is a cost  
> for being prepared to cancel an MPI_RECV we could cover that too)
> 2) MPI_Datatypes that define a contiguous buffer can be optimized if  
> it is known that there will never be a need to translate the data  
> between heterogeneous nodes.   An array of structures, where each  
> structure is a MPI_INT followed by an MPI_FLOAT is likely to be  
> contiguous. An MPI_SEND of count==100 can bypass the datatype engine  
> and be treated as a send of 800 bytes if the destination has the  
> same data representations. An MPI implementation that "knows" it  
> will not need to deal with data conversion can simplify the datatype  
> commit and internal representation by discarding the MPI_INT/ 
> MPI_FLOAT data and just recording that the type is 8 bytes with a  
> stride of 8.
> 3) The MPI standard either requires or strongly urges that an  
> MPI_REDUCE/MPI_ALLREDUCE give exactly the same answer every time. It  
> is not clear to me what that means. If it means a portable MPI like  
> MPICH or OpenMPI must give the same answer whether run on an Intel  
> cluster,an IBM Power cluster or a BlueGene then I would bet no MPI  
> in the world complies. If it means Version 5 of an MPI must give the  
> same answer Version 1 did, it would prevent new algorithms. However,  
> if it means that two "equivalent" reductions in a single application  
> run must agree then perhaps most MPIs comply. Whatever it means,  
> there are applications that do not need any "same answer" promise as  
> long at they can assume they will get a "correct" answer. Maybe they  
> can be provided a faster reduction algorithm.
> 4) MPI supports persistent send/recv which could allow some  
> optimizations in which half rendezvous, pinned memory for RDMA,  
> knowledge that both sides are contiguous buffers etc can be  
> leveraged. The ability to do this is damaged by the fact that the  
> standard requires a persistent send to match a normal receive and a  
> normal send to match a persistent receive. The MPI implementation  
> cannot make any assumptions that a matching send_init and recv_init  
> can be bound together.
> 5) Perhaps MPI pt2pt communication could use a half rendezvous  
> protocol if it were certain no receive would use MPI_ANY_SOURCE. If  
> all receives will use an explicit source then libmpi can have the  
> receive side send a notice to the send side that a receive is  
> waiting. There is no need for the send side to ship the envelop and  
> wait for a reply that the match is found. If MPI_ANY_SOURCE is  
> possible then the send side must always start the transaction. (I am  
> not aware of an issue with MPI_ANY_TAG but maybe somebody can think  
> of one)
> 6) It may be that an MPI implementation that is ready to do a spawn  
> or join must use a more complex matching/progress engine than it  
> would need if it knew the set of connections & networks it had at  
> MPI_Init could never be expanded.
> 7) The MPI standard allows a standard send to use an eager protocol  
> but requires that libmpi promise every eager message can be buffered  
> safely. The MPI implementation must fall back to rendezvous protocol  
> when the promise can no longer be kept. This semantic can be  
> expensive to maintain and produces serious scaling problems. Some  
> applications depend on this semantic but many, especially those  
> designed for massive scale, work in ways that ensure libmpi does not  
> need to throttle eager sends. The applications pace themselves.
> 8) requirement that multi WAIT/TEST functions accept mixed arrays of  
> MPI_Requests ( the multi WAIT/TEST routines may need special  
> handling in case someone passes both Isend/Irecv requests and  
> MPI_File_ixxx requests to the same MPI_Waitany for example) I bet  
> applications seldom do this but is allowed and must work.
> 9) Would an application promise not to use MPI-IO allow any MPI to  
> do an optimization?
> 10) Would an application promise not to use MPI-1sided allow any MPI  
> to do an optimization?
> 11) What others have I not thought of at all?
> I want to make it clear that none of these MPI_Init time assertions  
> should require an MPI implementation that provides the assert ready  
> MPI_Init to work differently. For example, the user assertion that  
> her application does not depend on a persistent send matching a  
> normal receive or normal send matching a persistent receive does not  
> require the MPI implementation to suppress such matches. It remains  
> the users responsibility to create a program that will still work as  
> expected on an MPI implementation that does not change its behavior  
> for any specific assertion.
> For some of these it would not be possible for libmpi to detect that  
> the user really is depending on something he told us we could shut  
> off.
> The interface might look like this:
> int MPI_Init_thread_xxx(int *argc, char *((*argv)[]), int required,  
> int *provided, int assertions)
> mpi.h would define constants like this:
> #define MPI_NO_SEND_CANCELS 0x00000001
> #define MPI_NO_ANY_SOURCE 0x00000002
> #define MPI_NO_REDUCE_CONSTRAINT 0x00000004
> #define MPI_NO_DATATYPE_XLATE 0x00000010
> #define MPI_NO_EAGER_THROTLE 0x00000020
> etc
> The set of valid assertion flags would be specified by the standard  
> as would be their precise meanings. It would always be valid for an  
> application to pass 0 (zero) as the assertions argument. It would  
> always be valid for an MPI implementation to ignore any or all  
> assertions. With a 32 bit integer for assertions, we could define  
> the interface in MPI 2.2 and add more assertions in MPI 3.0 if we  
> wanted to. We could consider an 64 bit assert to keep the door open  
> but I am pretty sure we can get by with 32 distinct assertions.
> A application call would look like: MPI_Init_thread_xxx( 0, 0,  
> I am sorry I will not be at the next meeting to discuss in person  
> but you can talk to Robert Blackmore.
> Dick Treumann
> Dick Treumann - MPI Team/TCEM
> IBM Systems & Technology Group
> Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846 Fax (845) 433-8363
> _______________________________________________
> mpi-22 mailing list
> mpi-22 at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-22

Jeff Squyres
Cisco Systems

mpi-forum mailing list
mpi-forum at lists.mpi-forum.org
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

More information about the mpi-forum mailing list