[Mpi3-rma] non-contiguous support in RMA & one-sided pack/unpack (?)

Wed Sep 16 12:18:31 CDT 2009

I am OK with having two interfaces, one which is WRT comm_world and only supports MPI_BYTE and the other a more general one.

Thanks,Vinod.

From: keith.d.underwood at intel.com
To: mpi3-rma at lists.mpi-forum.org
Date: Wed, 16 Sep 2009 11:08:51 -0600
Subject: Re: [Mpi3-rma] non-contiguous support in RMA	&	one-sided	pack/unpack (?)

But, going back to Bill’s point:  performance across
a range of platforms is key.  While you can’t have a function for
every usage (well, you can, but it would get cumbersome at some point), it may
be important to have a few levels of specialization in the API.  E.g. you
could have two variants:

MPI_Fast_RMA_xfer():  no data types, no communicators, etc.

MPI_Slow_RMA_xfer(): include the kitchen sink.

Yes, the naming is a little tongue in cheek ;-)

Keith

From:
mpi3-rma-bounces at lists.mpi-forum.org
[mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Vinod
tipparaju

Sent: Wednesday, September 16, 2009 10:58 AM

To: MPI 3.0 Remote Memory Access working group

Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
pack/unpack (?)

"else"
condition is non-trivial when

> one considers the latency of the BlueGene/P interconnect, the memory

> latency, the clock rate and the absence of dynamic communication

> threads. Forgive me for being specific to one machine, but it is the

> benchmark data I have obtained that inspires me to comment on these

> issues

You
have a valid point, but, as you already note, BG is an exception. It is a
"slowed down" exception. In most cases, latency associated with
branching will have to continue to be insignificant (by 2 orders of magnitude)
WRT to network latency. Hence what needs to be done for non-contiguous data WRT
and in the network is more important than a mere branch. What the network does
or doesn't do to support non-contiguous data transfers is critical. 

The input/guidance forum gives hence is
more important when taken during the network design -- many things may be fixed
in an implementations but one cannot fix what a network lack here.

Basically,
you believe having two calls instead of one  will help because you believe
you can help the implementation by giving hints on what you need from this
particular RMA transfer. I added attributes for a similar reason. Usages are
many, we cannot have an interface for each usage. We can however use attributes
to give hints/orders that may help implementations utilize networks better or
know when to give up. We can also support features such as Binding (discussed
in the last meeting) to help reduce latency. 

Thanks,

Vinod.

> Date: Wed, 16 Sep 2009 09:28:31 -0500

> From: jeff.science at gmail.com

> To: mpi3-rma at lists.mpi-forum.org

> Subject: Re: [Mpi3-rma] non-contiguous support in RMA & one-sided
pack/unpack (?)

> 

> > 1) MPI is implemented as a library. There is no connection between
the

> > analysis the compiler can do and the MPI implementation so what can
be

> > "known" by the compiler cannot be communicated to the
implementation.

> 

> Yes, I am well aware of this. What I mean is that I, the programmer,

> can choose to use in my source code the function or function arguments

> which invoke the fastest correct call, thus preventing MPI from

> proceeding through all the logic necessary to decipher the most

> general case.

> 

> For example, if "xfer" contained a flag just for

> contiguous/non-contiguous, I could do the following in my code:

> 

> ======================================

> /* get row */

> ( invoke xfer_get with contig flag set and using the simple datatype

> for a contiguous vector of floats)

> 

> /* get column */

> ( invoke xfer_get with noncontig flag set and using derived datatype

> for the column vector of a matrix of floats)

> ======================================

> 

> knowing that MPI can implement "xfer" like this:

> 

> ======================================

> xfer_get ( ... ){

> if (contig){

> low_level_comm_api_get(...)

> } else {

> process_datatype(...) /* figure out how many contiguous segments

> are in this */

> #ifdef (FAST_INJECTION_RATE)

> (for all contig segments in datatype){

> low_level_comm_api_get(...,datatype,...)

> }

> #elseif (SLOW_INJECTION_RATE)

> active_message(invoke_remote_pack(datatype,packed_buffer))

> low_level_comm_api_get(...,packed_buffer,...)

> local_unpack(packed_buffer,datatype)

> #endif

> }

> ======================================

> 

> If you look at ARMCI or the MPI-2 one-sided implementation over DCMF,

> the overhead associated with the "else" condition is non-trivial
when

> one considers the latency of the BlueGene/P interconnect, the memory

> latency, the clock rate and the absence of dynamic communication

> threads. Forgive me for being specific to one machine, but it is the

> benchmark data I have obtained that inspires me to comment on these

> issues.

> 

> On machines with capped message injection rates, implementing the

> "else" condition might be best done in a completely different
manner

> with one-sided packing, which is why I brought that up in the first

> place. Of course, the overhead of packing is large and thus should be

> reserved exclusively for the case where hardware prefers it, and then

> only for "xfer" calls where it is absolutely necessary,
otherwise it

> should be ignored completely.

> 

> > 2) There is no requirement in MPI that all tasks of an application
run the

> > same executable so even if one piece of source code that will be used
in an

> > application is free of complex datatypes, only the human being who
knows if

> > other pieces of code will be used by some tasks can make the
judgement that

> > there will be no calls with complex datatypes by any participating
task.

> 

> While this may be true for MPI ("no requirement that all tasks of an

> application run the same executable"), it may not be true for all

> systems on which MPI is to be run. Hence, I do not think this is a

> useful solution.

> 

> It is exactly my point that the human knows which calls require

> complex datatypes and should be able to make calls to "xfer" in
light

> of this knowledge. Currently the "xfer" API does not support

> bypassing complex datatypes and it is my contention that, in some

> cases, doing so will improve performance markedly.

> 

> Best,

> 

> Jeff

> 

> -- 

> Jeff Hammond

> Argonne Leadership Computing Facility

> jhammond at mcs.anl.gov / (630) 252-5381

> http://www.linkedin.com/in/jeffhammond

> http://home.uchicago.edu/~jhammond/

> _______________________________________________

> mpi3-rma mailing list

> mpi3-rma at lists.mpi-forum.org

> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20090916/a5f77cb3/attachment-0001.html>