[Mpi3-rma] RMA communication with single network messages

Oliver Mangold mangold at hlrs.de
Mon Jul 11 10:59:57 CDT 2011


What I would like to do with MPI is point-to-point communication with 
minimal overhead/latency, which means the communication

(a) does not need storage of data in intermediate buffers at the destination
(b) does not need more than 1 data transfer (network message) per 
communication

Using MPI point-to-point has the problem that it cannot abide both (a) 
and (b) at the same time, as the sent data might receive the destination 
before MPI_Recv is called.

So I am wondering if this will be possible with the improved RMA 
features of MPI-3. The draft paper (as the 2.2 standard also does) says:
> RMA communications fall in two categories:
> * active target communication, where data is moved from the memory of 
> one process
> to the memory of another, and both are explicitly involved in the 
> communication. This
> communication pattern is similar to message passing, except that all 
> the data transfer
> arguments are provided by one process, and the second process only 
> participates in
> the synchronization.
Actually this would be exactly what I wanted. The sender should provide 
all the transfer arguments (including destination memory location) and 
the receiver only waits for the message to arrive.

As I understand MPI-2.2 RMA, the sender has to do:

MPI_Win_start(group, flag, win);
MPI_Put(...,win);
MPI_Win_complete(win);

While the receiver does:

MPI_Win_post(group, flag, win);
MPI_Win_wait(win);

As I understand things, the semantics of MPI RMA require that 
destination memory is not written before the call to MPI_Win_post(). 
This means either must the receiver signal the sender that he has 
reached MPI_Win_post() or the data must be buffered at the receiver, 
resulting in the same problem as with MPI point-to-point. Please correct 
me if I'm wrong.

The problem could be solved, if there were windows that are always 
'open', this means no MPI_Win_start() and MPI_Win_post() necessary (only 
MPI_Win_complete() and MPI_Win_wait() to inform the receiver that the 
data has arrived). If the framework merges the data transfers needed for 
MPI_Put and MPI_Win_complete, a single message would be sufficient. But 
I couldn't find a feature in the draft standard that helps here. So the 
question is, is there a way to do what I want with MPI-3?

Maybe I should note that windows which are always open are useful, as 
with pair-wise-communication (with data always going both ways) double 
buffering would fix race conditions.

Example (assuming MPI_Win_start() and MPI_Win_post() are not needed):

Process 0:
while () {
   MPI_Put(...,win1a);
   MPI_Win_complete(win1a);
   MPI_Win_wait(win0a);
   ... do computation on data from win0a and write data for win1b ...
   MPI_Put(...,win1b);
   MPI_Win_complete(win1b);
   MPI_Win_wait(win0b);
   ... do computation on data from win0b and write data for win1a ...
}

Process 1:
while () {
   MPI_Put(...,win0a);
   MPI_Win_complete(win0a);
   MPI_Win_wait(win1a);
   ... do computation on data from win1a and write data for win0b ...
   MPI_Put(...,win0b);
   MPI_Win_complete(win0b);
   MPI_Win_wait(win1b);
   ... do computation on data from win1b and write data for 0a ...
}




More information about the mpiwg-rma mailing list