[Mpi3-rma] non-contiguous support in RMA & one-sided pack/unpack (?)

Wed Sep 16 17:26:51 CDT 2009

This is exactly why I think it is critical to think of the dichotomy as fast/slow instead of raw/complex.  What you are proposing is a raw/complex dichotomy that goes faster when you use the raw interface than when you use the complex interface.  But, the point of the raw mode is to offer significantly more speed to those who don't need the complex features.  Thus, anything that doesn't exist in the raw mode interface will inherently be slower.  If hardware adds a feature that isn't part of the raw mode interface, you have to either change MPI (not something we want to do a lot) or use the complex (aka slow) interface.  As I pointed out earlier, the flip side of that coin is that if you put something "too hard" in the raw/fast interface, you risk making the fast interface slow.

Keith

> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Jeff Hammond
> Sent: Wednesday, September 16, 2009 3:42 PM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
> pack/unpack (?)
>
> Keith,
>
> I don't really think in terms of the fast/slow dichotomy, rather the
> raw/complex one.  This is really meant to align with the philosophical
> dichotomy of MPI users - one group is using less than 10 functions and
> are perfectly happy with the results, while another group's needs push
> the limits of the standard, or at least require a non-trivial subset
> of the MPI standard's calls.
>
> Raw_xfer should be simple and fast when that is all that is needed,
> but should not be the only available option.  I contend, as someone
> who has not implemented nor will most likely not implement the MPI
> standard, but only as a rather heavy user of GA/ARMCI, that excluding
> Raw_xfer from the standard will prevent optimal performance on some
> systems, particularly the ones at the bleeding-edge.
>
> I would love it if the standard branch-predicted future hardware and
> defined RMA accordingly, but I think that the raw/complex dichotomy is
> appropriate for what we know to be true now and that it in no way
> handicaps the standard from responding to future hardware as new
> capability is realized.  The only way I would modify the Raw_xfer
> based upon my optimistic vision of the future of hardware would be to
> add a stride argument such that if someone develops hardware support
> for strided transfer over primitive types, then they can seamlessly
> become part of Raw_xfer.
>
> Dick,
>
> If you believe that the MPI implementations can do all the datatype
> decoding required in the general purpose with ~very~ little overhead,
> then Raw_xfer is not needed.  However, I'm pessimistic that this will
> be possible.  Local completion of contiguous transfers on BlueGene/P
> occurs in less than 10000 clock cycles.  If the datatype cache is
> large and is not prefetched to L1 cache, its a challenge to decode the
> general case on that time-scale.  I echo Keith's most recent post
> along these lines.
>
> I appreciate Vinod's comments that BlueGene/P is a special case, but
> few vendors are talking about clock speeds more than 5 times what
> PPC450 runs at (0.85 GHz) and neither have I heard that we're going to
> get to exascale with slower latency interconnects, so I hope 2-3
> microsecond latency becomes available on more machines than just
> BlueGene/X varieties.  Given that intra-node memory pipelines aren't
> getting any faster, any MPI function call that has to go to a
> database, i.e. for cached datatypes, is going to take a performance
> hit now and into the future.
>
> Best,
>
> Jeff
>
> On Wed, Sep 16, 2009 at 3:43 PM, Underwood, Keith D
> <keith.d.underwood at intel.com> wrote:
> > It is a very fine line to walk.  If you have "fast" and "slow"
> functions that map to what is possible today, then you have a standard
> which is nominally fast out of the box; however, you are then stuck
> having to implement EVERYTHING in the "slow" function before you can
> move that into the "fast" category for users.
> >
> > In contrast, if you branch predict (a little) what networks will be
> able to do in a couple of years and push on the vendors a bit in the
> definition of the "fast" interface, then the day the standard ships, it
> may not be of optimal speed, but you have hope of getting a slightly
> broader set of functionality in the fast category.
> >
> > Since I don't really expect that the new RMA ops will be available
> the same day as the first printed standard, I tend to favor the latter
> category.  Let's define a subset that is a good subset, rather than
> simply the subset that exists in DCMF and Verbs.
> >
> > Keith
> >
> >> -----Original Message-----
> >> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> >> bounces at lists.mpi-forum.org] On Behalf Of Jeff Hammond
> >> Sent: Wednesday, September 16, 2009 2:37 PM
> >> To: MPI 3.0 Remote Memory Access working group
> >> Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
> >> pack/unpack (?)
> >>
> >> I was having the same thoughts during my hideous commute today.
> >> Clearly, future systems may have very smart NICs or OSs which
> >> seamlessly manage communication co-processors to support, among
> other
> >> things, on-the-fly compression of non-contiguous buffers.  However,
> >> until some vendor starts taking seriously about this, I don't think
> it
> >> is good idea to ignore the present hardware support of RMA and
> create
> >> a standard which can perform well when the first MPI-3 libraries
> >> become available.
> >>
> >> I'm not sure I understand all this talk about assertions upon
> >> initialization, since I was hoping that Raw_xfer and general-purpose
> >> xfer (Gen_xfer) would both always be available but that the former
> >> would be much faster in certain contexts, since implementing the Raw
> >> version close to hardware doesn't require much work and would not
> >> interfere with how Gen_xfer operates.  Am I missing something?
> >>
> >> When the day finally comes that NICs can do Gen_xfer for a complex
> >> non-contiguous datatype as fast as Raw_xfer, would not the standard
> >> still be sufficient, since implementers would just produce a faster
> >> Gen_xfer and Raw_xfer would fade into the background?
> >>
> >> Raw_xfer is a pragmatic response to the present and near-future
> state
> >> of HPC systems rather than some sort of marquee feature to tape to
> the
> >> metaphorical refrigerator door.
> >>
> >> Best,
> >>
> >> Jeff
> >>
> >> On Wed, Sep 16, 2009 at 3:14 PM, Underwood, Keith D
> >> <keith.d.underwood at intel.com> wrote:
> >> > But we have to be very careful here.  We don't want to overly
> >> constrain what
> >> > can be thought of as "fast".  For example, I think it is perfectly
> >> > reasonable to implement accumulate on a NIC.  Just because it
> doesn't
> >> exist
> >> > today doesn't mean that it shouldn't be part of the "fast" MPI
> call.
> >> >
> >> >
> >> >
> >> > Now, datatype conversion. it is nominally possible that a NIC
> could
> >> do
> >> > datatype conversion - just like it is nominally possible to for a
> NIC
> >> to be
> >> > hooked to a Rube Goldberg device to implement MPI_Make_Breakfast
> ;-)
> >> >
> >> >
> >> >
> >> > Anyway, the point is that we need to be forward looking in
> defining
> >> "fast"
> >> > and "slow", not backward looking.
> >> >
> >> >
> >> >
> >> > Keith
> >> >
> >> >
> >> >
> >> > From: mpi3-rma-bounces at lists.mpi-forum.org
> >> > [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Richard
> >> Treumann
> >> > Sent: Wednesday, September 16, 2009 2:06 PM
> >> > To: MPI 3.0 Remote Memory Access working group
> >> > Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
> >> > pack/unpack (?)
> >> >
> >> >
> >> >
> >> > BINGO Jeff
> >> >
> >> > We might also remove the datatype argument and twin count
> arguments
> >> from
> >> > MPI_RMA_Raw_xfer just to eliminate the expectation that basic
> put/get
> >> do
> >> > datatype conversions when origin and target are on heterogeneous
> >> nodes.
> >> > There would be a single "count" argument and it represents the
> number
> >> of
> >> > contiguous bytes to be transferred.
> >> >
> >> > The assertion would be that there is no use of complex RMA. It
> would
> >> give
> >> > the implementation the option to leave its software agent dormant.
> >> Note that
> >> > having this assertion as an option for MPI_Init_asserted does not
> >> allow an
> >> > MPI implementation to avoid having an agent available. An
> application
> >> that
> >> > does not use the assertion can count on the agent being ready for
> any
> >> call
> >> > to "full baked" RMA.
> >> >
> >> > Dick
> >> >
> >> > Dick Treumann - MPI Team
> >> > IBM Systems & Technology Group
> >> > Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> >> > Tele (845) 433-7846 Fax (845) 433-8363
> >> >
> >> >
> >> > mpi3-rma-bounces at lists.mpi-forum.org wrote on 09/16/2009 03:43:15
> PM:
> >> >
> >> >> [image removed]
> >> >>
> >> >> Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
> pack/unpack
> >> (?)
> >> >>
> >> >> Jeff Hammond
> >> >>
> >> >> to:
> >> >>
> >> >> MPI 3.0 Remote Memory Access working group
> >> >>
> >> >> 09/16/2009 03:44 PM
> >> >>
> >> >> Sent by:
> >> >>
> >> >> mpi3-rma-bounces at lists.mpi-forum.org
> >> >>
> >> >> Please respond to "MPI 3.0 Remote Memory Access working group"
> >> >>
> >> >> I think that there is a need for two interfaces; one which is a
> >> >> portable interface to the low-level truly one-sided bulk transfer
> >> >> operation and another which is completely general and is
> permitted
> >> to
> >> >> do operations which require remote agency.
> >> >>
> >> >> For example, I am aware of no NIC which can do accumulate on its
> >> own,
> >> >> hence RMA_ACC_SUM and related operations require remote agency,
> and
> >> >> thus this category of RMA operations are not truly one-sided.
> >> >>
> >> >> Thus the standard might support two xfer calls:
> >> >>
> >> >> MPI_RMA_Raw_xfer(origin_addr, origin_count, origin_datatype,
> >> >> target_mem, target_disp, target_count , target_rank, request)
> >> >>
> >> >> which is exclusively for transferring contiguous bytes from one
> >> place
> >> >> to another, i.e. does raw put/get only, and the second, which has
> >> been
> >> >> described already, which handles the general case, including
> >> >> accumulation, non-contiguous and other complex operations.
> >> >>
> >> >> The distinction over remote agency is extremely important from a
> >> >> implementation perspective since contiguous put/get operations
> can
> >> be
> >> >> performed in a fully asynchronous non-interrupting way with a
> >> variety
> >> >> of interconnects, and thus exposing this procedure in the MPI
> >> standard
> >> >> will allow for very efficient implementations on some systems.
>  It
> >> >> should also encourage MPI users to think about their RMA needs
> and
> >> how
> >> >> they might restructure their code to take advantage of the faster
> >> >> flavor of xfer when doing so requires little modification.
> >> >>
> >> >> Jeff
> >> >>
> >> >> On Wed, Sep 16, 2009 at 1:49 PM, Vinod tipparaju
> >> >> <tipparajuv at hotmail.com> wrote:
> >> >> >>My argument is that any RMA depends on a call at the origin
> being
> >> able
> >> >> >> to
> >> >> >> trigger activity at the target. Modern RMA hardware has the
> >> hooksto do
> >> >> >> the
> >> >> >> remote side of MPI_Fast_RMA_xfer() efficiently based on a call
> at
> >> the
> >> >> >> origin. Because these hooks are in the hardware they are
> simply
> >> there.
> >> >> >> They
> >> >> >> do not use the CPU or hurt performance of things that do use
> the
> >> CPU.
> >> >> >
> >> >> > I read this as an argument that says two interfaces are not
> >> necessary.
> >> >> > Having application author promise (during init) it will not do
> >> anything
> >> >> > that
> >> >> > needs an agent is certainly useful. Particularly when, as you
> >> state,
> >> >> > "having
> >> >> > this agent standing by hurts general performance".
> >> >> > The things that potentially cannot be done without an agent
> >> >> > (technically,
> >> >> > everything but atomics could be done with out need for any
> >> agents)are
> >> >> > users
> >> >> > choice through explicit usage. Users choses these attributes
> being
> >> aware
> >> >> > of
> >> >> > their cost hence they can indicate that they will not use them
> >> ahead of
> >> >> > time
> >> >> > when they don't use them.
> >> >> > I have repeatedly considered dropping the atomicity attribute,
> I
> >> am
> >> >> > unable
> >> >> > to because it makes programming (and thinking) so much easier
> for
> >> many
> >> >> > applications.
> >> >> > Vinod.
> >> >> >
> >> >> >
> >> >> > ________________________________
> >> >> > To: mpi3-rma at lists.mpi-forum.org
> >> >> > From: treumann at us.ibm.com
> >> >> > Date: Wed, 16 Sep 2009 14:18:15 -0400
> >> >> > Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-
> sided
> >> >> > pack/unpack (?)
> >> >> >
> >> >> > The assertion could then be: MPI_NO_SLOW_RMA (also a bit tongue
> in
> >> >> > cheek)
> >> >> >
> >> >> > My argument is that any RMA depends on a call at the origin
> being
> >> able
> >> >> > to
> >> >> > trigger activity at the target. Modern RMA hardware has the
> hooks
> >> to do
> >> >> > the
> >> >> > remote side of MPI_Fast_RMA_xfer() efficiently based on a call
> at
> >> the
> >> >> > origin. Because these hooks are in the hardware they are simply
> >> there.
> >> >> > They
> >> >> > do not use the CPU or hurt performance of things that do use
> the
> >> CPU.
> >> >> >
> >> >> > RMA hardware may not have the hooks to do the target side of
> any
> >> >> > arbitrary
> >> >> > MPI_Slow_RMA_xfer().  As a result, support for the more complex
> >> RMA_xfer
> >> >> > may
> >> >> > require a wake-able software agent (thread maybe) to be
> standing
> >> by at
> >> >> > all
> >> >> > tasks just because they may become target of a Slow_RMA_xfer.
> >> >> >
> >> >> > If having this agent standing by hurts general performance of
> MPI
> >> >> > applications that will never make a call to Slow_RMA_xfer, why
> not
> >> let
> >> >> > the
> >> >> > applications author promise up front "I have no need of this
> >> agent."
> >> >> >
> >> >> > An MPI implementation that can support Slow_RMA_xfer with no
> extra
> >> costs
> >> >> > (send/recv latency, memory, packet interrupts, CPU contention)
> >> will
> >> >> > simply
> >> >> > ignore the assertion.
> >> >> >
> >> >> > BTW - I just took a look at the broad proposal and it may
> contain
> >> >> > several
> >> >> > things that cannot be done without a wake-able remote software
> >> agent.
> >> >> > That
> >> >> > argues for Keith's idea of an RMA operation which closely
> matches
> >> what
> >> >> > RMA
> >> >> > hardware does and a second one that brings along all the bells
> >> >> > andwhistles.
> >> >> > Maybe the assertion for an application that only uses the basic
> >> RMA call
> >> >> > or
> >> >> > uses no RMA at all could be MPI_NO_KITCHEN_SINK (even more
> tongue
> >> in
> >> >> > cheek).
> >> >> >
> >> >> >            Dick
> >> >> >
> >> >> >
> >> >> > Dick Treumann - MPI Team
> >> >> > IBM Systems & Technology Group
> >> >> > Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY
> 12601
> >> >> > Tele (845) 433-7846 Fax (845) 433-8363
> >> >> >
> >> >> >
> >> >> > mpi3-rma-bounces at lists.mpi-forum.org wrote on 09/16/2009
> 01:08:51
> >> PM:
> >> >> >
> >> >> >> [image removed]
> >> >> >>
> >> >> >> Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
> >> pack/unpack
> >> >> >> (?)
> >> >> >>
> >> >> >> Underwood, Keith D
> >> >> >>
> >> >> >> to:
> >> >> >>
> >> >> >> MPI 3.0 Remote Memory Access working group
> >> >> >>
> >> >> >> 09/16/2009 01:09 PM
> >> >> >>
> >> >> >> Sent by:
> >> >> >>
> >> >> >> mpi3-rma-bounces at lists.mpi-forum.org
> >> >> >>
> >> >> >> Please respond to "MPI 3.0 Remote Memory Access working group"
> >> >> >>
> >> >> >> But, going back to Bill's point:  performance across a range
> of
> >> >> >> platforms is key.  While you can't have a function for every
> >> usage
> >> >> >> (well, you can, but it would get cumbersome at some point), it
> >> may
> >> >> >> be important to have a few levels of specialization in the
> API.
> >> >> >> E.g. you could have two variants:
> >> >> >>
> >> >> >> MPI_Fast_RMA_xfer():  no data types, no communicators, etc.
> >> >> >> MPI_Slow_RMA_xfer(): include the kitchen sink.
> >> >> >>
> >> >> >> Yes, the naming is a little tongue in cheek ;-)
> >> >> >>
> >> >> >> Keith
> >> >> >>
> >> >> >> <snip>
> >> >> >
> >> >> > _______________________________________________
> >> >> > mpi3-rma mailing list
> >> >> > mpi3-rma at lists.mpi-forum.org
> >> >> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jeff Hammond
> >> >> Argonne Leadership Computing Facility
> >> >> jhammond at mcs.anl.gov / (630) 252-5381
> >> >> http://www.linkedin.com/in/jeffhammond
> >> >> http://home.uchicago.edu/~jhammond/
> >> >>
> >> >> _______________________________________________
> >> >> mpi3-rma mailing list
> >> >> mpi3-rma at lists.mpi-forum.org
> >> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >> >
> >> > _______________________________________________
> >> > mpi3-rma mailing list
> >> > mpi3-rma at lists.mpi-forum.org
> >> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jeff Hammond
> >> Argonne Leadership Computing Facility
> >> jhammond at mcs.anl.gov / (630) 252-5381
> >> http://www.linkedin.com/in/jeffhammond
> >> http://home.uchicago.edu/~jhammond/
> >>
> >> _______________________________________________
> >> mpi3-rma mailing list
> >> mpi3-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> jhammond at mcs.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> http://home.uchicago.edu/~jhammond/
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma