[Mpi3-rma] non-contiguous support in RMA & one-sided pack/unpack (?)

Wed Sep 16 15:36:50 CDT 2009

I was having the same thoughts during my hideous commute today.
Clearly, future systems may have very smart NICs or OSs which
seamlessly manage communication co-processors to support, among other
things, on-the-fly compression of non-contiguous buffers.  However,
until some vendor starts taking seriously about this, I don't think it
is good idea to ignore the present hardware support of RMA and create
a standard which can perform well when the first MPI-3 libraries
become available.

I'm not sure I understand all this talk about assertions upon
initialization, since I was hoping that Raw_xfer and general-purpose
xfer (Gen_xfer) would both always be available but that the former
would be much faster in certain contexts, since implementing the Raw
version close to hardware doesn't require much work and would not
interfere with how Gen_xfer operates.  Am I missing something?

When the day finally comes that NICs can do Gen_xfer for a complex
non-contiguous datatype as fast as Raw_xfer, would not the standard
still be sufficient, since implementers would just produce a faster
Gen_xfer and Raw_xfer would fade into the background?

Raw_xfer is a pragmatic response to the present and near-future state
of HPC systems rather than some sort of marquee feature to tape to the
metaphorical refrigerator door.

Best,

Jeff

On Wed, Sep 16, 2009 at 3:14 PM, Underwood, Keith D
<keith.d.underwood at intel.com> wrote:
> But we have to be very careful here.  We don’t want to overly constrain what
> can be thought of as “fast”.  For example, I think it is perfectly
> reasonable to implement accumulate on a NIC.  Just because it doesn’t exist
> today doesn’t mean that it shouldn’t be part of the “fast” MPI call.
>
>
>
> Now, datatype conversion… it is nominally possible that a NIC could do
> datatype conversion – just like it is nominally possible to for a NIC to be
> hooked to a Rube Goldberg device to implement MPI_Make_Breakfast ;-)
>
>
>
> Anyway, the point is that we need to be forward looking in defining “fast”
> and “slow”, not backward looking.
>
>
>
> Keith
>
>
>
> From: mpi3-rma-bounces at lists.mpi-forum.org
> [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Richard Treumann
> Sent: Wednesday, September 16, 2009 2:06 PM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
> pack/unpack (?)
>
>
>
> BINGO Jeff
>
> We might also remove the datatype argument and twin count arguments from
> MPI_RMA_Raw_xfer just to eliminate the expectation that basic put/get do
> datatype conversions when origin and target are on heterogeneous nodes.
> There would be a single "count" argument and it represents the number of
> contiguous bytes to be transferred.
>
> The assertion would be that there is no use of complex RMA. It would give
> the implementation the option to leave its software agent dormant. Note that
> having this assertion as an option for MPI_Init_asserted does not allow an
> MPI implementation to avoid having an agent available. An application that
> does not use the assertion can count on the agent being ready for any call
> to "full baked" RMA.
>
> Dick
>
> Dick Treumann - MPI Team
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846 Fax (845) 433-8363
>
>
> mpi3-rma-bounces at lists.mpi-forum.org wrote on 09/16/2009 03:43:15 PM:
>
>> [image removed]
>>
>> Re: [Mpi3-rma] non-contiguous support in RMA & one-sided pack/unpack (?)
>>
>> Jeff Hammond
>>
>> to:
>>
>> MPI 3.0 Remote Memory Access working group
>>
>> 09/16/2009 03:44 PM
>>
>> Sent by:
>>
>> mpi3-rma-bounces at lists.mpi-forum.org
>>
>> Please respond to "MPI 3.0 Remote Memory Access working group"
>>
>> I think that there is a need for two interfaces; one which is a
>> portable interface to the low-level truly one-sided bulk transfer
>> operation and another which is completely general and is permitted to
>> do operations which require remote agency.
>>
>> For example, I am aware of no NIC which can do accumulate on its own,
>> hence RMA_ACC_SUM and related operations require remote agency, and
>> thus this category of RMA operations are not truly one-sided.
>>
>> Thus the standard might support two xfer calls:
>>
>> MPI_RMA_Raw_xfer(origin_addr, origin_count, origin_datatype,
>> target_mem, target_disp, target_count , target_rank, request)
>>
>> which is exclusively for transferring contiguous bytes from one place
>> to another, i.e. does raw put/get only, and the second, which has been
>> described already, which handles the general case, including
>> accumulation, non-contiguous and other complex operations.
>>
>> The distinction over remote agency is extremely important from a
>> implementation perspective since contiguous put/get operations can be
>> performed in a fully asynchronous non-interrupting way with a variety
>> of interconnects, and thus exposing this procedure in the MPI standard
>> will allow for very efficient implementations on some systems.  It
>> should also encourage MPI users to think about their RMA needs and how
>> they might restructure their code to take advantage of the faster
>> flavor of xfer when doing so requires little modification.
>>
>> Jeff
>>
>> On Wed, Sep 16, 2009 at 1:49 PM, Vinod tipparaju
>> <tipparajuv at hotmail.com> wrote:
>> >>My argument is that any RMA depends on a call at the origin being able
>> >> to
>> >> trigger activity at the target. Modern RMA hardware has the hooksto do
>> >> the
>> >> remote side of MPI_Fast_RMA_xfer() efficiently based on a call at the
>> >> origin. Because these hooks are in the hardware they are simply there.
>> >> They
>> >> do not use the CPU or hurt performance of things that do use the CPU.
>> >
>> > I read this as an argument that says two interfaces are not necessary.
>> > Having application author promise (during init) it will not do anything
>> > that
>> > needs an agent is certainly useful. Particularly when, as you state,
>> > "having
>> > this agent standing by hurts general performance".
>> > The things that potentially cannot be done without an agent
>> > (technically,
>> > everything but atomics could be done with out need for any agents)are
>> > users
>> > choice through explicit usage. Users choses these attributes being aware
>> > of
>> > their cost hence they can indicate that they will not use them ahead of
>> > time
>> > when they don't use them.
>> > I have repeatedly considered dropping the atomicity attribute, I am
>> > unable
>> > to because it makes programming (and thinking) so much easier for many
>> > applications.
>> > Vinod.
>> >
>> >
>> > ________________________________
>> > To: mpi3-rma at lists.mpi-forum.org
>> > From: treumann at us.ibm.com
>> > Date: Wed, 16 Sep 2009 14:18:15 -0400
>> > Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
>> > pack/unpack (?)
>> >
>> > The assertion could then be: MPI_NO_SLOW_RMA (also a bit tongue in
>> > cheek)
>> >
>> > My argument is that any RMA depends on a call at the origin being able
>> > to
>> > trigger activity at the target. Modern RMA hardware has the hooks to do
>> > the
>> > remote side of MPI_Fast_RMA_xfer() efficiently based on a call at the
>> > origin. Because these hooks are in the hardware they are simply there.
>> > They
>> > do not use the CPU or hurt performance of things that do use the CPU.
>> >
>> > RMA hardware may not have the hooks to do the target side of any
>> > arbitrary
>> > MPI_Slow_RMA_xfer().  As a result, support for the more complex RMA_xfer
>> > may
>> > require a wake-able software agent (thread maybe) to be standing by at
>> > all
>> > tasks just because they may become target of a Slow_RMA_xfer.
>> >
>> > If having this agent standing by hurts general performance of MPI
>> > applications that will never make a call to Slow_RMA_xfer, why not let
>> > the
>> > applications author promise up front "I have no need of this agent."
>> >
>> > An MPI implementation that can support Slow_RMA_xfer with no extra costs
>> > (send/recv latency, memory, packet interrupts, CPU contention) will
>> > simply
>> > ignore the assertion.
>> >
>> > BTW - I just took a look at the broad proposal and it may contain
>> > several
>> > things that cannot be done without a wake-able remote software agent.
>> > That
>> > argues for Keith's idea of an RMA operation which closely matches what
>> > RMA
>> > hardware does and a second one that brings along all the bells
>> > andwhistles.
>> > Maybe the assertion for an application that only uses the basic RMA call
>> > or
>> > uses no RMA at all could be MPI_NO_KITCHEN_SINK (even more tongue in
>> > cheek).
>> >
>> >            Dick
>> >
>> >
>> > Dick Treumann - MPI Team
>> > IBM Systems & Technology Group
>> > Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
>> > Tele (845) 433-7846 Fax (845) 433-8363
>> >
>> >
>> > mpi3-rma-bounces at lists.mpi-forum.org wrote on 09/16/2009 01:08:51 PM:
>> >
>> >> [image removed]
>> >>
>> >> Re: [Mpi3-rma] non-contiguous support in RMA & one-sided pack/unpack
>> >> (?)
>> >>
>> >> Underwood, Keith D
>> >>
>> >> to:
>> >>
>> >> MPI 3.0 Remote Memory Access working group
>> >>
>> >> 09/16/2009 01:09 PM
>> >>
>> >> Sent by:
>> >>
>> >> mpi3-rma-bounces at lists.mpi-forum.org
>> >>
>> >> Please respond to "MPI 3.0 Remote Memory Access working group"
>> >>
>> >> But, going back to Bill’s point:  performance across a range of
>> >> platforms is key.  While you can’t have a function for every usage
>> >> (well, you can, but it would get cumbersome at some point), it may
>> >> be important to have a few levels of specialization in the API.
>> >> E.g. you could have two variants:
>> >>
>> >> MPI_Fast_RMA_xfer():  no data types, no communicators, etc.
>> >> MPI_Slow_RMA_xfer(): include the kitchen sink.
>> >>
>> >> Yes, the naming is a little tongue in cheek ;-)
>> >>
>> >> Keith
>> >>
>> >> <snip>
>> >
>> > _______________________________________________
>> > mpi3-rma mailing list
>> > mpi3-rma at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>> >
>> >
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> jhammond at mcs.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> http://home.uchicago.edu/~jhammond/
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
http://home.uchicago.edu/~jhammond/