[Mpi3-ft] Problem with reusing rendezvous memory buffers

Richard Graham richardg at mellanox.com
Tue Jul 30 15:58:01 CDT 2013


This is an implementation issue, not a standards issue.  Network h/w often has protections in place to prevent such a scenario from happening.  Much more than rendezvous protocols would run into issues under a host of failure scenarios.  The fact that two processes may have mapped the same physical pages does not mean that if a process dies, other can access the rights given to that process.

In the case of Infiniband, there are several things beyond just the mkey that are needed for access.

Rich

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Wesley Bland
Sent: Tuesday, July 30, 2013 3:43 PM
To: MPI3-FT Working Group
Subject: [Mpi3-ft] Problem with reusing rendezvous memory buffers

Pavan pointed out a problem to me yesterday related to memory buffers used with rendezvous protocols. If a process passes a piece of memory to the library in an MPI_RECV and the library gives that memory to the hardware, where it is pinned, we can get into trouble if one of the processes that could write into that memory fails. The problem comes from a process sending a slow message and then dying. It is possible that the other processes could detect and handle the failure before the slow message arrives. Then when the message does arrive, it could corrupt the memory without the application having a way to handle this. My whiteboard example is attached as an image.
 [cid:image001.jpg at 01CE8D3E.7F98C450]

We can't just unmap memory from the NIC when a failure occurs because that memory is still being used by another process's message. Some hardware supports unmapping memory for specific senders which would solve this issue, but some don't, such as InfiniBand, where the memory region just has a key and unmapping it removes it for all senders.

This problem doesn't have a good solution (that I've come up with), but I did come up with a solution. We would need to introduce another error code (something like MPI_ERR_BUFFER_UNUSABLE) that would be able to tell the application that the buffer that the library was using is no longer usable because it might be corrupted. For some hardware, this wouldn't have to be returned, but for hardware where this isn't possible, the library could pass this error to the implementation to say that I need a new buffer in order to complete this operation. On the sender side, the operation would probably complete successfully since to it, the memory was still available. That means that there will be some rollback necessary, but that's up to the application to figure out.

I know this is an expensive and painful solution, but this is all I've come up with so far. Thoughts from the group?

Thanks,
Wesley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130730/02ab0f89/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 327925 bytes
Desc: image001.jpg
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130730/02ab0f89/attachment-0001.jpg>


More information about the mpiwg-ft mailing list