[Mpi3-rma] non-contiguous support in RMA & one-sided pack/unpack (?)

Wed Sep 16 09:28:31 CDT 2009

> 1) MPI is implemented as a library. There is no connection between the
> analysis the compiler can do and the MPI implementation so what can be
> "known" by the compiler cannot be communicated to the implementation.

Yes, I am well aware of this.  What I mean is that I, the programmer,
can choose to use in my source code the function or function arguments
which invoke the fastest correct call, thus preventing MPI from
proceeding through all the logic necessary to decipher the most
general case.

For example, if "xfer" contained a flag just for
contiguous/non-contiguous, I could do the following in my code:

======================================
/* get row */
( invoke xfer_get with contig flag set and using the simple datatype
for a contiguous vector of floats)

/* get column */
( invoke xfer_get with noncontig flag set and using derived datatype
for the column vector of a matrix of floats)
======================================

knowing that MPI can implement "xfer" like this:

======================================
xfer_get ( ... ){
if (contig){
   low_level_comm_api_get(...)
} else {
   process_datatype(...) /* figure out how many contiguous segments
are in this */
#ifdef (FAST_INJECTION_RATE)
   (for all contig segments in datatype){
      low_level_comm_api_get(...,datatype,...)
   }
#elseif (SLOW_INJECTION_RATE)
   active_message(invoke_remote_pack(datatype,packed_buffer))
   low_level_comm_api_get(...,packed_buffer,...)
   local_unpack(packed_buffer,datatype)
#endif
}
======================================

If you look at ARMCI or the MPI-2 one-sided implementation over DCMF,
the overhead associated with the "else" condition is non-trivial when
one considers the latency of the BlueGene/P interconnect, the memory
latency, the clock rate and the absence of dynamic communication
threads.  Forgive me for being specific to one machine, but it is the
benchmark data I have obtained that inspires me to comment on these
issues.

On machines with capped message injection rates, implementing the
"else" condition might be best done in a completely different manner
with one-sided packing, which is why I brought that up in the first
place.  Of course, the overhead of packing is large and thus should be
reserved exclusively for the case where hardware prefers it, and then
only for "xfer" calls where it is absolutely necessary, otherwise it
should be ignored completely.

> 2) There is no requirement in MPI that all tasks of an application run the
> same executable so even if one piece of source code that will be used in an
> application is free of complex datatypes, only the human being who knows if
> other pieces of code will be used by some tasks can make the judgement that
> there will be no calls with complex datatypes by any participating task.

While this may be true for MPI ("no requirement that all tasks of an
application run the same executable"), it may not be true for all
systems on which MPI is to be run.  Hence, I do not think this is a
useful solution.

It is exactly my point that the human knows which calls require
complex datatypes and should be able to make calls to "xfer" in light
of this knowledge.  Currently the "xfer" API does not support
bypassing complex datatypes and it is my contention that, in some
cases, doing so will improve performance markedly.

Best,

Jeff

-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
http://home.uchicago.edu/~jhammond/