[Mpi-forum] [Mpi-22] Another pre-preposal for MPI 2.2 or 3.0

Thu Apr 24 06:50:08 CDT 2008

Good points.  I'm also a little uncomfortable with just 32 attributes  
-- 32 seems like a big number right now, but we wouldn't want to be  
accused of only thinking of a world where you only need 640k of  
RAM.  ;-)  I would also like to keep the door open to implementation- 
specific attributes.

The obvious arbitrary-storage candidate is MPI_Info, but to be able to  
set this stuff during MPI_INIT means that the Info functions have to  
be available before MPI_INIT (I think this came up before).

Also, it might be worthwhile to have the MPI return the set of  
assertions that it was / was not able to support in some kind of  
definitive way, so that you can know that MPI X *supports* assertion  
Y, whereas MPI A *doesn't care* about assertion B, etc. -- similar to  
how the thread level is returned now.

On Apr 24, 2008, at 4:13 AM, Supalov, Alexander wrote:

> Hi,
>
> What happens if we run beyond 32 or 64 attributes? I think we may  
> rather
> need something more scalable than an int, and possibly more  
> hierarchical
> than a linear list of attributes. That would map into subsets  
> nicely, by
> the way.
>
> Another thing is that in some cases, the attitude of the MPI for each
> attribute may be "yes", "no", and "don't care/undefined". I can  
> imagine,
> for example, that there's no eager protocol at all, and so no  
> throttle,
> albeit in a way different from when there are eager and rendezvous
> protocols, but they are well tuned to provide a smooth curve. What  
> will
> happen in either case: will MPI proceed or terminate? By having
> attributes with values "yes", "no", "tell me" we may be able to
> accommodate this easier than with the bitwise "yes" and "no".
>
> Finally, we'll we treat thread support level as yet another attribute?
> Will we define any query function for these attributes? Will they be
> job-wide or communicator-wide?
>
> Best regards.
>
> Alexander
>
> -----Original Message-----
> From: mpi-forum-bounces at lists.mpi-forum.org
> [mailto:mpi-forum-bounces at lists.mpi-forum.org] On Behalf Of Jeff  
> Squyres
> Sent: Thursday, April 24, 2008 3:18 AM
> To: MPI 2.2
> Cc: mpi-forum at lists.mpi-forum.org
> Subject: Re: [Mpi-forum] [Mpi-22] Another pre-preposal for MPI 2.2 or
> 3.0
>
> I think that this is a generally good idea.
>
> As I understand it, you are stating that this is basically a bit
> stronger than "hints" -- the word "assertions" carries a bit more of a
> connotation that these are strict promises by the user.
>
>
> On Apr 22, 2008, at 1:38 PM, Richard Treumann wrote:
>
>> I have a proposal for providing information to the MPI
>> implementation at MPI_INIT time to allow certain optimizations
>> within the run. This is not a "hints" mechanism because it does
>> change the semantic rules for MPI in the job run. A correct
>> "vanilla" MPI application could give different results or fail if
>> faulty information is provided.
>>
>> I am interested in what the Forum members think about this idea
>> before I try to formalize it.
>>
>> I will state up front that I am a skeptic about most of the MPI
>> Subset goals I hear described. However, I think this is a form of
>> subsetting I would support. I say "I think" because it is possible
>> we will find serious complexities that would make me back away.. If
>> this looks as straightforward as I expect, perhaps we could look at
>> it for MPI 2.2. The most basic valid implementation of this is a
>> small amount of work for an implementer. (Well within the scope of
>> MPI 2.2 effort / policy)
>>
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> = 
>> =====================================================================
>>
>> The MPI standard has a number of thorny semantic requirements that a
>> typical program does not depend on and that an MPI implementation
>> may pay a performance penalty by guaranteeing. A standards defined
>> mechanism which allows the application to explicitly let libmpi off
>> the hook at MPI_Init time on the ones it does not depend on may
>> allow better performance in some cases. This would be an "assert"
>> rather than a "hints" mechanism because it would be valid for an MPI
>> implementation to fail a job that depends on an MPI feature but lets
>> libmpi off the hook on it at the MPI_Init call In most, but not all,
>> of these cases the MPI implementation could easily give an error
>> message if the application did something it had promised not to do.
>>
>> Here is a partial list of sometimes troublesome semantic  
>> requirements.
>>
>> 1) MPI_CANCEL on MPI_ISEND probably cannot be correctly supported
>> without adding a message ID to every message sent. Using space in
>> the message header adds cost.and may be a complete waste for an
>> application that never tries to cancel an ISEND. (If there is a cost
>> for being prepared to cancel an MPI_RECV we could cover that too)
>>
>> 2) MPI_Datatypes that define a contiguous buffer can be optimized if
>> it is known that there will never be a need to translate the data
>> between heterogeneous nodes.   An array of structures, where each
>> structure is a MPI_INT followed by an MPI_FLOAT is likely to be
>> contiguous. An MPI_SEND of count==100 can bypass the datatype engine
>> and be treated as a send of 800 bytes if the destination has the
>> same data representations. An MPI implementation that "knows" it
>> will not need to deal with data conversion can simplify the datatype
>> commit and internal representation by discarding the MPI_INT/
>> MPI_FLOAT data and just recording that the type is 8 bytes with a
>> stride of 8.
>>
>> 3) The MPI standard either requires or strongly urges that an
>> MPI_REDUCE/MPI_ALLREDUCE give exactly the same answer every time. It
>> is not clear to me what that means. If it means a portable MPI like
>> MPICH or OpenMPI must give the same answer whether run on an Intel
>> cluster,an IBM Power cluster or a BlueGene then I would bet no MPI
>> in the world complies. If it means Version 5 of an MPI must give the
>> same answer Version 1 did, it would prevent new algorithms. However,
>> if it means that two "equivalent" reductions in a single application
>> run must agree then perhaps most MPIs comply. Whatever it means,
>> there are applications that do not need any "same answer" promise as
>> long at they can assume they will get a "correct" answer. Maybe they
>> can be provided a faster reduction algorithm.
>>
>> 4) MPI supports persistent send/recv which could allow some
>> optimizations in which half rendezvous, pinned memory for RDMA,
>> knowledge that both sides are contiguous buffers etc can be
>> leveraged. The ability to do this is damaged by the fact that the
>> standard requires a persistent send to match a normal receive and a
>> normal send to match a persistent receive. The MPI implementation
>> cannot make any assumptions that a matching send_init and recv_init
>> can be bound together.
>>
>> 5) Perhaps MPI pt2pt communication could use a half rendezvous
>> protocol if it were certain no receive would use MPI_ANY_SOURCE. If
>> all receives will use an explicit source then libmpi can have the
>> receive side send a notice to the send side that a receive is
>> waiting. There is no need for the send side to ship the envelop and
>> wait for a reply that the match is found. If MPI_ANY_SOURCE is
>> possible then the send side must always start the transaction. (I am
>> not aware of an issue with MPI_ANY_TAG but maybe somebody can think
>> of one)
>>
>> 6) It may be that an MPI implementation that is ready to do a spawn
>> or join must use a more complex matching/progress engine than it
>> would need if it knew the set of connections & networks it had at
>> MPI_Init could never be expanded.
>>
>> 7) The MPI standard allows a standard send to use an eager protocol
>> but requires that libmpi promise every eager message can be buffered
>> safely. The MPI implementation must fall back to rendezvous protocol
>> when the promise can no longer be kept. This semantic can be
>> expensive to maintain and produces serious scaling problems. Some
>> applications depend on this semantic but many, especially those
>> designed for massive scale, work in ways that ensure libmpi does not
>> need to throttle eager sends. The applications pace themselves.
>>
>> 8) requirement that multi WAIT/TEST functions accept mixed arrays of
>> MPI_Requests ( the multi WAIT/TEST routines may need special
>> handling in case someone passes both Isend/Irecv requests and
>> MPI_File_ixxx requests to the same MPI_Waitany for example) I bet
>> applications seldom do this but is allowed and must work.
>>
>> 9) Would an application promise not to use MPI-IO allow any MPI to
>> do an optimization?
>>
>> 10) Would an application promise not to use MPI-1sided allow any MPI
>> to do an optimization?
>>
>> 11) What others have I not thought of at all?
>>
>> I want to make it clear that none of these MPI_Init time assertions
>> should require an MPI implementation that provides the assert ready
>> MPI_Init to work differently. For example, the user assertion that
>> her application does not depend on a persistent send matching a
>> normal receive or normal send matching a persistent receive does not
>> require the MPI implementation to suppress such matches. It remains
>> the users responsibility to create a program that will still work as
>> expected on an MPI implementation that does not change its behavior
>> for any specific assertion.
>>
>> For some of these it would not be possible for libmpi to detect that
>> the user really is depending on something he told us we could shut
>> off.
>>
>> The interface might look like this:
>> int MPI_Init_thread_xxx(int *argc, char *((*argv)[]), int required,
>> int *provided, int assertions)
>>
>> mpi.h would define constants like this:
>>
>> #define MPI_NO_SEND_CANCELS 0x00000001
>> #define MPI_NO_ANY_SOURCE 0x00000002
>> #define MPI_NO_REDUCE_CONSTRAINT 0x00000004
>> #define MPI_NO_DATATYPE_XLATE 0x00000010
>> #define MPI_NO_EAGER_THROTLE 0x00000020
>> etc
>>
>> The set of valid assertion flags would be specified by the standard
>> as would be their precise meanings. It would always be valid for an
>> application to pass 0 (zero) as the assertions argument. It would
>> always be valid for an MPI implementation to ignore any or all
>> assertions. With a 32 bit integer for assertions, we could define
>> the interface in MPI 2.2 and add more assertions in MPI 3.0 if we
>> wanted to. We could consider an 64 bit assert to keep the door open
>> but I am pretty sure we can get by with 32 distinct assertions.
>>
>>
>> A application call would look like: MPI_Init_thread_xxx( 0, 0,
>> MPI_THREAD_MULTIPLE, &provided,
>> MPI_NO_SEND_CANCELS | MPI_NO_ANY_SOURCE | MPI_NO_DATATYPE_XLATE);
>>
>> I am sorry I will not be at the next meeting to discuss in person
>> but you can talk to Robert Blackmore.
>>
>>
>>
>>
>> Dick Treumann
>> Dick Treumann - MPI Team/TCEM
>> IBM Systems & Technology Group
>> Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
>> Tele (845) 433-7846 Fax (845) 433-8363
>> _______________________________________________
>> mpi-22 mailing list
>> mpi-22 at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-22
>
>
> -- 
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
> ---------------------------------------------------------------------
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen Germany
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> VAT Registration No.: DE129385895
> Citibank Frankfurt (BLZ 502 109 00) 600119052
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum

-- 
Jeff Squyres
Cisco Systems