[Mpi3-ft] Problem with reusing rendezvous memory buffers

Pavan Balaji balaji at mcs.anl.gov
Tue Jul 30 15:47:53 CDT 2013

Hmm.  Yes, you are right.  Generating different per-process rkeys is an 
option on IB.  Though that's obviously less scalable than a single rkey 
and hurts performance too because a new rkey has to be generated for 
each process.  Even more of a reason for FT to have a requested/provided 
option like threads.

However, please also think about this problem for other networks that 
might not have such hardware protection capabilities (K Computer comes 
to mind).  Maybe they cannot provide MPI-specified FT, and that would be 

  -- Pavan

On 07/30/2013 02:59 PM, Sur, Sayantan wrote:
> Hi Wesley,
> Looks like your attachment didn’t make it through. Using IB, one can
> generate rkeys for each sender and just invalidate the key for the
> observed failed process. HW can just drop the “slow” message when it
> arrives. I’m assuming that generating keys should be fast in the future
> given that recently announced HW/firmware has support for on-demand
> registration. In any case, it is not a restriction of IB per se.
> Thanks,
> Sayantan
> *From:*mpi3-ft-bounces at lists.mpi-forum.org
> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Wesley Bland
> *Sent:* Tuesday, July 30, 2013 11:04 AM
> *To:* MPI3-FT Working Group
> *Subject:* [Mpi3-ft] Problem with reusing rendezvous memory buffers
> Pavan pointed out a problem to me yesterday related to memory buffers
> used with rendezvous protocols. If a process passes a piece of memory to
> the library in an MPI_RECV and the library gives that memory to the
> hardware, where it is pinned, we can get into trouble if one of the
> processes that could write into that memory fails. The problem comes
> from a process sending a slow message and then dying. It is possible
> that the other processes could detect and handle the failure before the
> slow message arrives. Then when the message does arrive, it could
> corrupt the memory without the application having a way to handle this.
> My whiteboard example is attached as an image.
> We can't just unmap memory from the NIC when a failure occurs because
> that memory is still being used by another process's message. Some
> hardware supports unmapping memory for specific senders which would
> solve this issue, but some don't, such as InfiniBand, where the memory
> region just has a key and unmapping it removes it for all senders.
> This problem doesn't have a good solution (that I've come up with), but
> I did come up with a solution. We would need to introduce another error
> code (something like MPI_ERR_BUFFER_UNUSABLE) that would be able to tell
> the application that the buffer that the library was using is no longer
> usable because it might be corrupted. For some hardware, this wouldn't
> have to be returned, but for hardware where this isn't possible, the
> library could pass this error to the implementation to say that I need a
> new buffer in order to complete this operation. On the sender side, the
> operation would probably complete successfully since to it, the memory
> was still available. That means that there will be some rollback
> necessary, but that's up to the application to figure out.
> I know this is an expensive and painful solution, but this is all I've
> come up with so far. Thoughts from the group?
> Thanks,
> Wesley
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Pavan Balaji

More information about the mpiwg-ft mailing list