[Mpi3-rma] Remote Completion Question/Comment/Contribution

Lars Schneidenbach lschneid at cs.uni-potsdam.de
Wed Jan 28 11:23:36 CST 2009


Hi All,

at the EuroPVM/MPI08 conference I was invited to comment on the proposal for 
RMA (sorry for the delay). I was presenting a paper on synchronization issues 
of the MPI-2 RMA standard. Therefore, my comments will focus on 
synchronization in the MPI-3 RMA proposal.

I was especially looking at the Remote Completion parts. If I understand this 
right, there are two mechanism to control Remote Completion:
1) The Remote Completion attribute requires remote completion before the 
operation can complete ("When Remote Completion attribute is set ..., 
completion guarantees that the operation has been completed on the 
Target...").

2) The MPI_RMA_Complete call explicitly forces remote completion of an 
operation. The proposal says "When MPI_RMA_Complete returns, all previously 
issued RMA operations are completed at the specified targets" Does this make 
the MPI_RMA_Complete-call a blocking call? Is there a way of non-blocking 
completion checks similar to MPI_Test for send/recv?




If I'm wrong about the Remote Completion, you may stop reading here ;-). 
Otherwise, the following may be a contribution for improvement:

I found out that it can be beneficial for one-sided communication to have 
separate API-calls for completion and NOTIFICATION. What do I mean by 
NOTIFICATION? For example, if an application knows its "last" operation (e.g. 
of an iteration) on the target memory, it can tell the target about that fact 
by a NOTIFICATION.  This allows remote completion to start as early as 
possible. While it does not force the completion (local or remote) at this 
point. The operation is still non-blocking. Thus, the lib or network can 
decide when to perform the operation and the completion until a blocking 
fence or completion call is issued.

Concerning the interface for such a notification, it is more portable to 
include the NOTIFICATION into the PUT/GET/ACC-calls.  Some networks perform 
better when sending a separate synchronisation message (With InfiniBand it is 
better to write a small message for synchronization than make use of RDMA 
with notification.). Other networks perform better if the synchronization can 
be included in the header of a message. In the latter case, a separate call 
for notification would require the library to either send an extra message or 
defer the data transfer (this prevents/reduces overlap of computation and 
communication which is one of the intention of non-blocking communication).
Thus, an additional attribute could be a nice way to signal a "last" message. 
It would allow the implementation to decide whether to inline the 
notification or to send an extra message.

If an application does not know the last operation, a separate call like the 
MPI_RMA_Complete call comes into place (may make sense to have a non-blocking 
version as well).

This would also enable the possibility to overlap the "synchronization 
message" with computation (depending on the application) as it can be done 
with data messages. However, this may be just a tiny bit of improvement, 
since the synchronisation message is rather short.

Benefits: Performance improvements
Drawbacks: Some more complexity in the API (attribute + non-blocking 
notification call)


I'm not sure if implementation issues are considered here. If yes:
Unfortunately, this NOTIFICATION signal depends on ordering of messages (for 
the last message). The notification has to be delivered as the last message 
regardless of a separate message or integration in the data message.
However, this is a "problem" of the implementation on top of non-ordering or 
multi-path networks. I did not dig into details on this yet, but 
MPI_RMA_Completion and MPI_RMA_Fence as well as MPI-2 MPI_Win_complete/fence 
suffer a similar fate for completion in these environments.

Is there something I miss?

Best regards
Lars

PS:
For more details on this, the name of the paper was:
"Synchronization issues in the MPI-2 one-sided communication API." 
Schneidenbach, Böhme, Schnor




More information about the mpiwg-rma mailing list