[Mpi-22] [Mpi-forum] MPI 2.2 proposal:resolving MPI_Request_free issues

Wed Jul 16 18:31:42 CDT 2008

>> I don't quite understand examples 1 and 2 (how would they cause
segv's
>> in the TCP stack).  It is permissible to (pseudocode):
>>
>>   while (bytes_to_send > 0) {
>>      rc = write(fd, buffer, bytes_to_send);
>>      if (rc > 0) {
>>         buffer += rc;
>>         bytes_to_send -= rc;
>>      } else {
>>         ...error...
>>      }
>>   }
>>   free(buffer);
>>
>> regardless of what the receiver does.  I'm not a kernel guy; does
>> updating TCP sequence numbers also interact with the payload buffer?
>
>[erezh] it will never happen with your code above; but you are not
using
>async zcopy.
>The pattern in windows is to use overlapped send (write) which is still
>active when the function returns, and is the most efficient way to send
>your buffer. I know it's possible with Linux but I don't' have the
exact
>pattern.

So, how does this work for other users of zero copy?  How do they know
when the send buffer is truly free?

>> FWIW: I can see the RMA interconnect example much easier.  You can
>> imagine a scenario where a sender successfully sends and the receiver
>> successfully receives, but the hardware ACK from the receiver gets
>> lost.  The receiver then sends an MPI message back to the sender, but
>> the sender is still in the middle of a retransmit timeout (while
>> waiting for the hardware ACK that was lost).  In this case, the user
>> app may free the buffer too soon, resulting in a segv (or some other
>> lion, tiger, or bear) when the sending hardware tries to retransmit.
>>
>
>[erezh] Correct; this is the scenario I was describing with the RDMA
write.

It would be interesting to see exactly what the error mode here is.
Retransmitting corrupted data should be ok, since a correctly delivered
message means that the retransmit must be dropped.  I suppose that if
the NIC speaks virtual addresses and the free actually results in a trap
to the kernel that unmaps the pages, then the NIC could retransmit and
find that there isn't a valid page table entry...  

>> MPI_Request_free() is used in application programs. For example, it
is
>> the easiest (and portable) way to send a non-blocking acknowledge
>> to a destination process.
>
>[erezh] See Adam Moody proposal for MPI_REQUEST_IGNORE; I think it's
safer
>than the current pattern
>
>>
>>      MPI_Isend (buf, 0, MPI_INT, dest, TAG_ACK, comm, lrequest)
>>      MPI_Request_free (lrequest)

Hrm.  This example doesn't seem to trigger Erez's concern... That's a
zero length buffer, right?  

Solution #5:  Change the advice to users - "...the arrival of the reply
informs the sender that the send has completed and the send buffer can
be overwritten.  If the buffer will ever be freed, the application
should call MPI_Wait or MPI_Cancel instead of MPI_Request_free."  

If you have received your application level ack from the remote side,
MPI_Cancel should give you the equivalent functionality of
MPI_Request_free - allowing you to release the request without having to
wait for local completion.  I'm not entirely sure WHY you would want to
do this, but... ;-)

Keith