[Mpi-forum] [Mpi-22] Another pre-preposal for MPI 2.2 or 3.0

Thu Apr 24 09:25:44 CDT 2008

I am aiming for a balance between simplicity (which leads to affordabe
implementation in libmpi and practical use by applications & libraries) and
versitility.  If we standardize something well defined and affordable that
gives 95% of the value and both MPI implementations and MPI
applications/libraries begin to support/apply it we come out way ahead.
Assertions even have a good probability of being portable if there are only
a dozen defined.

If we provide unbounded permutations and extensibility, most MPI
implementations will ignore all but a handfull and the application
developer will need to invest a lot of effort in setting switches without
being able to assume they are ever read by the MPI implementation.

Dick Treumann  -  MPI Team/TCEM
IBM Systems & Technology Group
Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

mpi-22-bounces at lists.mpi-forum.org wrote on 04/24/2008 04:13:19 AM:

> Hi,
>
> What happens if we run beyond 32 or 64 attributes? I think we may rather
> need something more scalable than an int, and possibly more hierarchical
> than a linear list of attributes. That would map into subsets nicely, by
> the way.
I avoided the word "attribute" and chose the word "assertion" for a reason.
I would consider the word "promise" except that it feels a bit
anthropomorphic for my taste.
An assertion is a statement by the application that it acts in a way which
does not depend on a specific guarantee in the vanilla standard.
An assertion is not a directive to libmpi to do something different. It
is a promise that the application will be OK if libmpi passes up support
for
the specific semantic requirement.  Libmpi is within its rights to
terminate
a job if libmpi can recognize the application "lied". Libmpi is even within
its rights to give unexpected results if the application "lied". For
example,
if the application really does depend on bitwise  reproducable reduction
results and asserts it does not, the applicaton may give some surprises.

My feeling is that no matter what we do there will never be more than a
handfull of assertions that gain wide support. My fundamental concern with
the subsetting concept is my suspicion that
1) it will end of with 100 or 1000 or 1000000 permutations,
2) supporting all of them would give 100 units of value and be very complex
3) an MPI implementation that tries to support a large number becomes
untestable
4) a well chosen subset would give 95 units of value
5) consensus on the worthwhile aspects of subsetting is needed before you
get
   portabality and that will take years to evolve. (maybe forever)
6) writing pluggable libraries will become much harder because each library
   will need to deal with the wide range of "subsets" somebody may plug it
   into.
>
> Another thing is that in some cases, the attitude of the MPI for each
> attribute may be "yes", "no", and "don't care/undefined". I can imagine,
> for example, that there's no eager protocol at all, and so no throttle,
> albeit in a way different from when there are eager and rendezvous
> protocols, but they are well tuned to provide a smooth curve. What will
> happen in either case: will MPI proceed or terminate? By having
> attributes with values "yes", "no", "tell me" we may be able to
> accommodate this easier than with the bitwise "yes" and "no".
Most applications will either depend on a semantic guarentee or will not.
That
may not always be easy for the application writer to recognize but there is
no "dont' care" needed in this proposal. I suppose someone might ask "What
if
the application wants to provide dual code and let the MPI implementation
decide?"
That would call for a "don't care" option but it is not at all clear to me
that MPI implementations would often have a basis for a run time decision
to
support a semantic guarentee that an application has said "don't care" for.
If support for MPI_CANCEL hurts performance and the implementation has
added
logic to support CANCEL when the MPI_NO_SEND_CANCELS assertion is absent
and give
better performance when the MPI_NO_SEND_CANCELS assertion is provided, why
would
it ever consider supporting CANCEL in an application where the init time
said
"don't care"?

>
> Finally, we'll we treat thread support level as yet another attribute?
I am open to considering this.
> Will we define any query function for these attributes? Will they be
> job-wide or communicator-wide?
Assertions are job wide. A query mechanism seems like a reasonable addition
and
if the set of valid assertions is defined by the standard, a query
mechanism
would be pretty simple. I think the most useful query response would
involve the
implementation saying whether it is acting on the assertion but I could
argue for
a query that reports what the app has set. If I write an application and do
not
code a call to MPI_CANCEL I can assert MPI_NO_SEND_CANCELS but if my app
calls an
opaque library that uses MPI_CANCEL I may not know it does that.
A well written library that depends on a semantic that can be suspended by
assertion
may want to have a way to check that the assertion was not made or at least
not
affecting libmpi behavior.

The needs of opaque libraries is another argument for keeping the assertion
list
well defined. The library author must be able to predict which MPI
guarentees can
be pulled out from under him and that list must be short enough so as he
writes
the library code he can predict the spots where the ice may be thin and
guard
against them. The author of "Freds_lib" can use a query and has two options
if
he does not like the answer. He can issue a fatal error and tell the user:
"Assertion MPI_NO_SEND_CANCELS is incompatable with using Freds_lib. Please
remove
this assertion" or he can provide an alternate code path that that does not
depend on being able to cancel an MPI_Isend.
>
> Best regards.
>
> Alexander
>
> -----Original Message-----
> From: mpi-forum-bounces at lists.mpi-forum.org
> [mailto:mpi-forum-bounces at lists.mpi-forum.org] On Behalf Of Jeff Squyres
> Sent: Thursday, April 24, 2008 3:18 AM
> To: MPI 2.2
> Cc: mpi-forum at lists.mpi-forum.org
> Subject: Re: [Mpi-forum] [Mpi-22] Another pre-preposal for MPI 2.2 or
> 3.0
>
> I think that this is a generally good idea.
>
> As I understand it, you are stating that this is basically a bit
> stronger than "hints" -- the word "assertions" carries a bit more of a
> connotation that these are strict promises by the user.
>
>
> On Apr 22, 2008, at 1:38 PM, Richard Treumann wrote:
>
> > I have a proposal for providing information to the MPI
> > implementation at MPI_INIT time to allow certain optimizations
> > within the run. This is not a "hints" mechanism because it does
> > change the semantic rules for MPI in the job run. A correct
> > "vanilla" MPI application could give different results or fail if
> > faulty information is provided.
> >
> > I am interested in what the Forum members think about this idea
> > before I try to formalize it.
> >
> > I will state up front that I am a skeptic about most of the MPI
> > Subset goals I hear described. However, I think this is a form of
> > subsetting I would support. I say "I think" because it is possible
> > we will find serious complexities that would make me back away.. If
> > this looks as straightforward as I expect, perhaps we could look at
> > it for MPI 2.2. The most basic valid implementation of this is a
> > small amount of work for an implementer. (Well within the scope of
> > MPI 2.2 effort / policy)
> >
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > ======================================================================
> >
> > The MPI standard has a number of thorny semantic requirements that a
> > typical program does not depend on and that an MPI implementation
> > may pay a performance penalty by guaranteeing. A standards defined
> > mechanism which allows the application to explicitly let libmpi off
> > the hook at MPI_Init time on the ones it does not depend on may
> > allow better performance in some cases. This would be an "assert"
> > rather than a "hints" mechanism because it would be valid for an MPI
> > implementation to fail a job that depends on an MPI feature but lets
> > libmpi off the hook on it at the MPI_Init call In most, but not all,
> > of these cases the MPI implementation could easily give an error
> > message if the application did something it had promised not to do.
> >
> > Here is a partial list of sometimes troublesome semantic requirements.
> >
> > 1) MPI_CANCEL on MPI_ISEND probably cannot be correctly supported
> > without adding a message ID to every message sent. Using space in
> > the message header adds cost.and may be a complete waste for an
> > application that never tries to cancel an ISEND. (If there is a cost
> > for being prepared to cancel an MPI_RECV we could cover that too)
> >
> > 2) MPI_Datatypes that define a contiguous buffer can be optimized if
> > it is known that there will never be a need to translate the data
> > between heterogeneous nodes.   An array of structures, where each
> > structure is a MPI_INT followed by an MPI_FLOAT is likely to be
> > contiguous. An MPI_SEND of count==100 can bypass the datatype engine
> > and be treated as a send of 800 bytes if the destination has the
> > same data representations. An MPI implementation that "knows" it
> > will not need to deal with data conversion can simplify the datatype
> > commit and internal representation by discarding the MPI_INT/
> > MPI_FLOAT data and just recording that the type is 8 bytes with a
> > stride of 8.
> >
> > 3) The MPI standard either requires or strongly urges that an
> > MPI_REDUCE/MPI_ALLREDUCE give exactly the same answer every time. It
> > is not clear to me what that means. If it means a portable MPI like
> > MPICH or OpenMPI must give the same answer whether run on an Intel
> > cluster,an IBM Power cluster or a BlueGene then I would bet no MPI
> > in the world complies. If it means Version 5 of an MPI must give the
> > same answer Version 1 did, it would prevent new algorithms. However,
> > if it means that two "equivalent" reductions in a single application
> > run must agree then perhaps most MPIs comply. Whatever it means,
> > there are applications that do not need any "same answer" promise as
> > long at they can assume they will get a "correct" answer. Maybe they
> > can be provided a faster reduction algorithm.
> >
> > 4) MPI supports persistent send/recv which could allow some
> > optimizations in which half rendezvous, pinned memory for RDMA,
> > knowledge that both sides are contiguous buffers etc can be
> > leveraged. The ability to do this is damaged by the fact that the
> > standard requires a persistent send to match a normal receive and a
> > normal send to match a persistent receive. The MPI implementation
> > cannot make any assumptions that a matching send_init and recv_init
> > can be bound together.
> >
> > 5) Perhaps MPI pt2pt communication could use a half rendezvous
> > protocol if it were certain no receive would use MPI_ANY_SOURCE. If
> > all receives will use an explicit source then libmpi can have the
> > receive side send a notice to the send side that a receive is
> > waiting. There is no need for the send side to ship the envelop and
> > wait for a reply that the match is found. If MPI_ANY_SOURCE is
> > possible then the send side must always start the transaction. (I am
> > not aware of an issue with MPI_ANY_TAG but maybe somebody can think
> > of one)
> >
> > 6) It may be that an MPI implementation that is ready to do a spawn
> > or join must use a more complex matching/progress engine than it
> > would need if it knew the set of connections & networks it had at
> > MPI_Init could never be expanded.
> >
> > 7) The MPI standard allows a standard send to use an eager protocol
> > but requires that libmpi promise every eager message can be buffered
> > safely. The MPI implementation must fall back to rendezvous protocol
> > when the promise can no longer be kept. This semantic can be
> > expensive to maintain and produces serious scaling problems. Some
> > applications depend on this semantic but many, especially those
> > designed for massive scale, work in ways that ensure libmpi does not
> > need to throttle eager sends. The applications pace themselves.
> >
> > 8) requirement that multi WAIT/TEST functions accept mixed arrays of
> > MPI_Requests ( the multi WAIT/TEST routines may need special
> > handling in case someone passes both Isend/Irecv requests and
> > MPI_File_ixxx requests to the same MPI_Waitany for example) I bet
> > applications seldom do this but is allowed and must work.
> >
> > 9) Would an application promise not to use MPI-IO allow any MPI to
> > do an optimization?
> >
> > 10) Would an application promise not to use MPI-1sided allow any MPI
> > to do an optimization?
> >
> > 11) What others have I not thought of at all?
> >
> > I want to make it clear that none of these MPI_Init time assertions
> > should require an MPI implementation that provides the assert ready
> > MPI_Init to work differently. For example, the user assertion that
> > her application does not depend on a persistent send matching a
> > normal receive or normal send matching a persistent receive does not
> > require the MPI implementation to suppress such matches. It remains
> > the users responsibility to create a program that will still work as
> > expected on an MPI implementation that does not change its behavior
> > for any specific assertion.
> >
> > For some of these it would not be possible for libmpi to detect that
> > the user really is depending on something he told us we could shut
> > off.
> >
> > The interface might look like this:
> > int MPI_Init_thread_xxx(int *argc, char *((*argv)[]), int required,
> > int *provided, int assertions)
> >
> > mpi.h would define constants like this:
> >
> > #define MPI_NO_SEND_CANCELS 0x00000001
> > #define MPI_NO_ANY_SOURCE 0x00000002
> > #define MPI_NO_REDUCE_CONSTRAINT 0x00000004
> > #define MPI_NO_DATATYPE_XLATE 0x00000010
> > #define MPI_NO_EAGER_THROTLE 0x00000020
> > etc
> >
> > The set of valid assertion flags would be specified by the standard
> > as would be their precise meanings. It would always be valid for an
> > application to pass 0 (zero) as the assertions argument. It would
> > always be valid for an MPI implementation to ignore any or all
> > assertions. With a 32 bit integer for assertions, we could define
> > the interface in MPI 2.2 and add more assertions in MPI 3.0 if we
> > wanted to. We could consider an 64 bit assert to keep the door open
> > but I am pretty sure we can get by with 32 distinct assertions.
> >
> >
> > A application call would look like: MPI_Init_thread_xxx( 0, 0,
> > MPI_THREAD_MULTIPLE, &provided,
> > MPI_NO_SEND_CANCELS | MPI_NO_ANY_SOURCE | MPI_NO_DATATYPE_XLATE);
> >
> > I am sorry I will not be at the next meeting to discuss in person
> > but you can talk to Robert Blackmore.
> >
> >
> >
> >
> > Dick Treumann
> > Dick Treumann - MPI Team/TCEM
> > IBM Systems & Technology Group
> > Dept 0lva / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> > Tele (845) 433-7846 Fax (845) 433-8363
> > _______________________________________________
> > mpi-22 mailing list
> > mpi-22 at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-22
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
> ---------------------------------------------------------------------
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen Germany
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> VAT Registration No.: DE129385895
> Citibank Frankfurt (BLZ 502 109 00) 600119052
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
> _______________________________________________
> mpi-22 mailing list
> mpi-22 at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-22
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20080424/9dd34b0e/attachment-0001.html>