[Mpi3-rma] Remote Completion Question/Comment/Contribution
Lars Schneidenbach
lschneid at cs.uni-potsdam.de
Wed Jan 28 11:23:36 CST 2009
Hi All,
at the EuroPVM/MPI08 conference I was invited to comment on the proposal for
RMA (sorry for the delay). I was presenting a paper on synchronization issues
of the MPI-2 RMA standard. Therefore, my comments will focus on
synchronization in the MPI-3 RMA proposal.
I was especially looking at the Remote Completion parts. If I understand this
right, there are two mechanism to control Remote Completion:
1) The Remote Completion attribute requires remote completion before the
operation can complete ("When Remote Completion attribute is set ...,
completion guarantees that the operation has been completed on the
Target...").
2) The MPI_RMA_Complete call explicitly forces remote completion of an
operation. The proposal says "When MPI_RMA_Complete returns, all previously
issued RMA operations are completed at the specified targets" Does this make
the MPI_RMA_Complete-call a blocking call? Is there a way of non-blocking
completion checks similar to MPI_Test for send/recv?
If I'm wrong about the Remote Completion, you may stop reading here ;-).
Otherwise, the following may be a contribution for improvement:
I found out that it can be beneficial for one-sided communication to have
separate API-calls for completion and NOTIFICATION. What do I mean by
NOTIFICATION? For example, if an application knows its "last" operation (e.g.
of an iteration) on the target memory, it can tell the target about that fact
by a NOTIFICATION. This allows remote completion to start as early as
possible. While it does not force the completion (local or remote) at this
point. The operation is still non-blocking. Thus, the lib or network can
decide when to perform the operation and the completion until a blocking
fence or completion call is issued.
Concerning the interface for such a notification, it is more portable to
include the NOTIFICATION into the PUT/GET/ACC-calls. Some networks perform
better when sending a separate synchronisation message (With InfiniBand it is
better to write a small message for synchronization than make use of RDMA
with notification.). Other networks perform better if the synchronization can
be included in the header of a message. In the latter case, a separate call
for notification would require the library to either send an extra message or
defer the data transfer (this prevents/reduces overlap of computation and
communication which is one of the intention of non-blocking communication).
Thus, an additional attribute could be a nice way to signal a "last" message.
It would allow the implementation to decide whether to inline the
notification or to send an extra message.
If an application does not know the last operation, a separate call like the
MPI_RMA_Complete call comes into place (may make sense to have a non-blocking
version as well).
This would also enable the possibility to overlap the "synchronization
message" with computation (depending on the application) as it can be done
with data messages. However, this may be just a tiny bit of improvement,
since the synchronisation message is rather short.
Benefits: Performance improvements
Drawbacks: Some more complexity in the API (attribute + non-blocking
notification call)
I'm not sure if implementation issues are considered here. If yes:
Unfortunately, this NOTIFICATION signal depends on ordering of messages (for
the last message). The notification has to be delivered as the last message
regardless of a separate message or integration in the data message.
However, this is a "problem" of the implementation on top of non-ordering or
multi-path networks. I did not dig into details on this yet, but
MPI_RMA_Completion and MPI_RMA_Fence as well as MPI-2 MPI_Win_complete/fence
suffer a similar fate for completion in these environments.
Is there something I miss?
Best regards
Lars
PS:
For more details on this, the name of the paper was:
"Synchronization issues in the MPI-2 one-sided communication API."
Schneidenbach, Böhme, Schnor
More information about the mpiwg-rma
mailing list