[Mpi-forum] MPI 2.2 proposal: resolving MPI_Request_free issues
erezh at MICROSOFT.com
Mon Jul 14 16:50:16 CDT 2008
The set of issues related to MPI_Request_free have come up in the last MPI forum meeting. I've added an MPI 2.2 proposal for resolving those issues.
Author: Erez Haba
The MPI_Request_free mechanism was provided for reasons of performance and convenience on the sending side. However there are number of issues with its definitions and the advice to use text that conflicts with many implementations.
Advice to user quote:
"Once a request is freed by a call to MPI_REQUEST_FREE, it is not possible to check for the successful completion of the associated communication with calls to MPI_WAIT or MPI_TEST. Also, if an error occurs subsequently during the communication, an error code cannot be returned to the user - such an error must be treated as fatal."
This is the only place in the MPI standard that mandates an error to be FATAL, regardless of the user settings. This is truly unrecoverable because the user can not associate the error with the failed send and cannot recover after MPI_Request_free was called. This poses a problem for a fault-tolerance implementation as it must handle this failure without the ability to notify the user for the specific error for the lack of context.
Advice to user quote:
"Questions arise as to how one knows when the operations have completed when using MPI_REQUEST_FREE. Depending on the program logic, there may be other ways in which the program knows that certain operations have completed and this makes usage of MPI_REQUEST_FREE practical. For example, an active send request could be freed when the logic of the program is such that the receiver sends a reply to the message sent - the arrival of the reply informs the sender that the send has completed and the send buffer can be reused."
The suggestion to reuse (free) the buffer once a reply arrived seems straight forward, however it is a naïve one which might lead to access violation at best or worse data corruption. When zero copy is being use the local interconnect resource manager might still be using the send buffer even though a reply message was received. One alternative to enable that programming paradigm is always to copy the buffer (or the tail of the buffer) when using MPI_Isend. This will prevent any misuse of the user buffer. Consider the following examples:
Example 1: TCP interconnect
Rank 0 on node A sends message x to rank 1 on node B using the following sequence,
Upon receive message x, rank 1 sends message y back to rank 0; when rank 0 receive message y it frees the buffer with the following sequence
This would result in access violation (seg fault) as the TCP stack still tries to touch buffer after it was freed. This would happen because although node B sends back message y, it did not piggyback the TCP acknowledgment seq numbers back with message x. As a result message y was consumed by the application and buffer was freed. Hence if the TCP stack on node A tries to resend buffer tail resulting in access violation. (note that node A sent message x using zcopy)
Example 2: TCP interconnect (2 connections)
To make it easier to understand think about the above same problem, but now there are two TCP connections each to only deliver messages in one direction and TCP acknowledgment in the other. This setting decouples the reply message from the TCP acknowledgment and makes the previous example easier to competence.
Example 3: RMA interconnect (using RMA write)
In this case rank 0 on node A issue its MPI_Isend using RDMA Write. The receiver is polling on memory detecting that the write complete and sends a reply back. The reply message bypasses the hardware acknowledgment that the write was successfully complete. Rank 0 processes it and the app frees the memory which disrupts the DMA on node A and causes the send to fail.
3 proposals from the least restrictive to the most restrictive:
Remove the advice to user to reuse the buffer once a reply has arrived. There is no safe way to reuse the buffer (free), overwrite is somewhat safer.
Remove the advice to user altogether, disallow the usage pattern of freeing active requests. Only inactive requests are allowed to be freed. (i.e., not started).
Deprecate MPI_Request_free. Users can always use MPI_Wait to complete the request.
Use solution #2, as users still need to free requests if they are not used; e.g., the app called MPI_Send_init but never got the start that request; hence the request still need to be freed.
Calling MPI_Request_free was added for convenience (the user does not have to remember to free the request later) and performance (freeing the request is not on the critical path when the reply message arrives and require processing).
Impact on Implementations
(assuming solution 2)
High quality implementations would detect a call to free an active request and return an error.
Impact on Applications / Users
Applications must refrain freeing the request before it's complete. Therefore, requiring the application to track the request handle to completion. (this potentially breaks existing applicaitons)
Entry for the Change Log
Using MPI_Request_free to free an active request has been disallowed.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpi-forum