[Mpi3-ft] Problem with reusing rendezvous memory buffers

Wed Jul 31 12:25:22 CDT 2013

On Wednesday, July 31, 2013 at 11:50 AM, George Bosilca wrote:

> On Tue, Jul 30, 2013 at 10:47 PM, Pavan Balaji <balaji at mcs.anl.gov (mailto:balaji at mcs.anl.gov)> wrote:
> >  
> > Hmm. Yes, you are right. Generating different per-process rkeys is an
> > option on IB. Though that's obviously less scalable than a single rkey and
> > hurts performance too because a new rkey has to be generated for each
> > process. Even more of a reason for FT to have a requested/provided option
> > like threads.
> >  
>  
>  
> I fail to understand the scenario allowing you to reach such a
> conclusion. You want to have an MPI_RECV with a buffer where multiple
> senders can do RMA operations. The only way this can be done in the
> context of the MPI standard is if each of the receives on this
> particular buffer are using non-contiguous datatypes. Thus, unlike
> what you suggest in your answer above, this is not hurting performance
> as you are already in a niche mode (I'm not even talking about the
> fact that usually non-contiguous datatypes conflicts with RMA
> operations). Moreover, you suppose that the detection of a dead
> process and the re-posting of the receive buffer can happen faster
> than an RMA message cross the network. The only potential case where
> such a scenario can happen is when multiple paths between the source
> and the destination exist, and the failure detection happen on one
> path while the RMA message took another one. This is highly improbable
> in most cases.
>  
>  

This isn't necessarily about RMA. This can happen with send/receive as well. I don't disagree that this scenario is unlikely, but it could result in a case where the user ends up with bad data and can't do anything about it.  
>  
> There are too many ifs in this scenario to make it plausible. Even if
> we suppose that all those ifs will be true, as Rich said, this is an
> issue of [quality of] implementation not MPI standard. A high quality
> MPI implementation will delay reporting the process failure error on
> that particular MPI_RECV until all possible RMA from the dead process
> were either discarded by the network, or written to the memory.
>  
> >  
> > However, please also think about this problem for other networks that might
> > not have such hardware protection capabilities (K Computer comes to mind).
> >  
>  
>  
> K computer ? My understanding is that there are such capabilities in
> the TOFU network. I might be wrong thou, in which case I would
> definitively appreciate if you can you pinpoint me to a
> link/documentation that proves your point?
>  
> > Maybe they cannot provide MPI-specified FT, and that would be fine.
>  
> Not really, FT can be supported without overhead for the normal
> execution even for the types of netwrok you mention. The solution I
> presented above, uses the timeouts of the network layer to ensure no
> delivery can occur after the error reporting, by delaying the error
> reporting until all timeout occurred. Trivial to implement, and
> without impact on the normal execution path.
>  
> Thanks,
> George.
>  
>  
> >  
> > -- Pavan
> >  
> >  
> > On 07/30/2013 02:59 PM, Sur, Sayantan wrote:
> > >  
> > > Hi Wesley,
> > >  
> > > Looks like your attachment didn’t make it through. Using IB, one can
> > > generate rkeys for each sender and just invalidate the key for the
> > > observed failed process. HW can just drop the “slow” message when it
> > > arrives. I’m assuming that generating keys should be fast in the future
> > > given that recently announced HW/firmware has support for on-demand
> > > registration. In any case, it is not a restriction of IB per se.
> > >  
> > > Thanks,
> > >  
> > > Sayantan
> > >  
> > > *From:*mpi3-ft-bounces at lists.mpi-forum.org (mailto:mpi3-ft-bounces at lists.mpi-forum.org)
> > > [mailto:mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Wesley Bland
> > > *Sent:* Tuesday, July 30, 2013 11:04 AM
> > > *To:* MPI3-FT Working Group
> > > *Subject:* [Mpi3-ft] Problem with reusing rendezvous memory buffers
> > >  
> > >  
> > > Pavan pointed out a problem to me yesterday related to memory buffers
> > > used with rendezvous protocols. If a process passes a piece of memory to
> > > the library in an MPI_RECV and the library gives that memory to the
> > > hardware, where it is pinned, we can get into trouble if one of the
> > > processes that could write into that memory fails. The problem comes
> > > from a process sending a slow message and then dying. It is possible
> > > that the other processes could detect and handle the failure before the
> > > slow message arrives. Then when the message does arrive, it could
> > > corrupt the memory without the application having a way to handle this.
> > > My whiteboard example is attached as an image.
> > >  
> > > We can't just unmap memory from the NIC when a failure occurs because
> > > that memory is still being used by another process's message. Some
> > > hardware supports unmapping memory for specific senders which would
> > > solve this issue, but some don't, such as InfiniBand, where the memory
> > > region just has a key and unmapping it removes it for all senders.
> > >  
> > > This problem doesn't have a good solution (that I've come up with), but
> > > I did come up with a solution. We would need to introduce another error
> > > code (something like MPI_ERR_BUFFER_UNUSABLE) that would be able to tell
> > > the application that the buffer that the library was using is no longer
> > > usable because it might be corrupted. For some hardware, this wouldn't
> > > have to be returned, but for hardware where this isn't possible, the
> > > library could pass this error to the implementation to say that I need a
> > > new buffer in order to complete this operation. On the sender side, the
> > > operation would probably complete successfully since to it, the memory
> > > was still available. That means that there will be some rollback
> > > necessary, but that's up to the application to figure out.
> > >  
> > > I know this is an expensive and painful solution, but this is all I've
> > > come up with so far. Thoughts from the group?
> > >  
> > > Thanks,
> > >  
> > > Wesley
> > >  
> > >  
> > >  
> > > _______________________________________________
> > > mpi3-ft mailing list
> > > mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > >  
> >  
> >  
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >  
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >  
>  
>  
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130731/1fd747e1/attachment-0001.html>