[MPI3 Fortran] MPI non-blocking transfers

Wed Jan 21 07:51:22 CST 2009

On Jan 21, 2009, at 6:04 AM, N.M. Maclaren wrote:

>   1) Most people seem to agree that the semantics of the buffers used
> for MPI non-blocking transfers and pending input/output storage
> affectors are essentially identical, with READ, WRITE and WAIT
> corresponding to MPI_Isend, MPI_IRecv and MPI_Wait (and variations).
>
> Do you agree with this and, if not, why not?

I'm an MPI implementor; I don't know enough about Fortran to answer  
your questions definitively, but I can state what the MPI non-blocking  
send/receive buffer semantics are.

There are several different flavors of non-blocking sends/receives in  
MPI; I'll use MPI_ISEND and MPI_IRECV as token examples ("I" =  
"immediate", meaning that the function returns "immediately",  
potentially before the message has actually been sent or received).

1. When an application invokes MPI_ISEND / MPI_IRECV, it essentially  
hands off the buffer to the MPI implementation and promises not to  
write to the buffer until later.  The MPI implementation then "owns"  
the buffer.

2. A rule is just about to be passed in MPI-2.2 such that *sends*  
(e.g., MPI_ISEND) can still *read* the buffer while the send is  
ongoing (writing to the buffer while the send is ongoing is nonsense,  
of course).

3. The buffer is specified by a triple of arguments (I'll explain in  
terms of C because of my inexperience with Fortran):

   - void *buffer: a pointer representing the first base of the buffer  
(NOTE: it may not actually point to the first byte of the message!)
   - int count: number of datatypes in the message (see the next  
argument)
   - MPI_Datatype type: the datatype of the message, implying both the  
size and the interpretation of the bytes

MPI has a number of intrinsic datatypes (such as MPI_INTEGER,  
representing a single fortran INTEGER).  The intrinsic MPI datatypes  
can be combined in several ways to represent complex data structures.   
Hence, it is possible to build up a user-defined MPI_Datatype that  
represents a C struct -- even if the struct has memory "holes" in it.   
As such, MPI_Datatypes can be considered a memory map of (relative  
offset, type) tuples, where the "relative offset" part is relative to  
the (buffer) argument in MPI_ISEND/MPI_IRECV/etc.  MPI_INTEGER could  
therefore be considered a single (0, N-byte integer) tuple (where N is  
whatever is correct for your platform).

A special buffer, denoted by MPI_BOTTOM, is an arbitrarily-fixed place  
in memory (usually 0, but it doesn't have to be).  Since MPI_Datatypes  
are composed of relative offsets, applications can build datatypes  
relative to MPI_BOTTOM for [effectively] direct placement into memory.

Some Fortran examples

     INTEGER i
     CALL MPI_ISEND(i, 1, MPI_INTEGER, ...)
   Sends a single INTEGER starting at the buffer pointed to by i

     INTEGER iarray(10)
     CALL MPI_ISEND(iarray, 10, MPI_INTEGER, ...)
   Sends 10 INTEGERs starting at the buffer pointed to by iarray

     INTEGER iarray(9999)
     CALL MPI_ISEND(iarray, 10, MPI_INTEGER, ...)
   Same as above -- sends the first 10 INTEGERs starting at the buffer  
pointed to by iarray

     INTEGER iarray(9999)
     CALL MPI_ISEND(iarray(37), 10, MPI_INTEGER, ...)
   Sends iarray(37) through iarray(46)

     INTEGER iarray(9999)
    C ..build up a datatype relative to MPI_BOTTOM that points to  
iarray..
     CALL MPI_ISEND(MPI_BOTTOM, 10, my_datatype, ...)
   Sends the first 10 elements of iarray

Some C examples:

     int i;
     MPI_Isend(&i, 1, MPI_INT, ...);
   Sends 1 int starting at the buffer pointed to by &i

     int i[9999];
     MPI_Isend(&i[37], 10, MPI_INT, ...);
   Sends i[37] through i[46]

     int i[9999];
     /* ..build up MPI_Datatype relative to MPI_BOTTOM that points to  
&i[0].. */
     MPI_Isend(MPI_BOTTOM, 1, my_datatype, ...);
   Sends i[0]

     struct foo { int a; double b; char c; } foo_instance;
     /* ..build up MPI_Datatype to represent struct foo.. */
     MPI_Isend(&foo_instance, 1, foo_datatype, ...);
   Sends the foo struct (likely only transmitting the data, not the  
"holes")

4. A returned value from MPI_ISEND and MPI_RECV is a handle that can  
be passed to MPI later to check and see if the communication  
associated with that handle has completed.  There are essentially two  
flavors of the check-for-completion semantic: polling blocking.

   - MPI_TEST accepts a single request handle and polls to see if the  
associated communication has completed, and essentially returns  
"true" (the communication has completed; the application now owns the  
buffer) or "false" (the communication has not yet completed; MPI still  
owns the buffer).

   - MPI_WAIT accepts a single request handle and blocks until the  
associated communication has completed.  When MPI_WAIT returns, the  
application owns the buffer associated with the communication.

   - There are array versions of MPI_TEST and MPI_WAIT as well; you  
can pass an array of requests to the array flavors of MPI_TEST (where  
some may complete and some may not) or MPI_WAIT (where all requests  
will complete before returning).

5. All Fortran MPI handles are [currently] expressed as INTEGERs.  The  
MPI implementation takes these integers and converts them to a back- 
end C pointer.  We are contemplating changing this for the upcoming  
F03 MPI bindings to avoid this translation where Fortran handles will  
likely be the same representation as C MPI handles (i.e., pointers --  
or, thought of differently, "very large address-sided integers").

Hope that made sense!

-- 
Jeff Squyres
Cisco Systems