[Mpi3-ft] Problem with reusing rendezvous memory buffers

Wesley Bland wbland at mcs.anl.gov
Fri Aug 2 08:15:54 CDT 2013


I think you're right. We just need to make sure we're careful. In that case, I withdraw my suggestion.

On Thursday, August 1, 2013 at 5:09 AM, George Bosilca wrote:

> On Wed, Jul 31, 2013 at 7:28 PM, Pavan Balaji <balaji at mcs.anl.gov (mailto:balaji at mcs.anl.gov)> wrote:
> > George,
> >  
> > I'm not talking about RMA here. I'm just talking about send/recv. I'm also
> > not talking about simultaneous receives posted on the same buffer. They are
> > separated by failure notifications, COMM_AGREE, whatever else you need. I
> > think Sayantan got the point Wesley mentioned and his solution is correct
> > (assuming enough support from the network to do so). Though I'm not
> > convinced it's not adding overhead. Sayantan should comment on this.
> >  
>  
>  
> If you're not talking RMA then Sayantan is absolutely right, this is a
> non issue as any message from a process considered as dead should be
> discarded. In fact the current ULFM implementation already take in
> account such cases, as they can arrive in any multi-rail situation.
>  
> George.
>  
> > With respect to the K Computer, I don't have a link. This was my
> > understanding from what the Fujitsu folks mentioned (and was the reason why
> > they didn't want to release their low-level API; since the hardware has no
> > protection, anyone could write to anyone's memory). And I'm not trying to
> > prove that K computer is not good enough in any way. That was just an
> > example. My point was only that you should consider the case that not all
> > networks would have such protection capabilities.
> >  
> > -- Pavan
> >  
> >  
> > On 07/31/2013 11:50 AM, George Bosilca wrote:
> > >  
> > > On Tue, Jul 30, 2013 at 10:47 PM, Pavan Balaji <balaji at mcs.anl.gov (mailto:balaji at mcs.anl.gov)> wrote:
> > > >  
> > > >  
> > > > Hmm. Yes, you are right. Generating different per-process rkeys is an
> > > > option on IB. Though that's obviously less scalable than a single rkey
> > > > and
> > > > hurts performance too because a new rkey has to be generated for each
> > > > process. Even more of a reason for FT to have a requested/provided
> > > > option
> > > > like threads.
> > > >  
> > >  
> > >  
> > >  
> > > I fail to understand the scenario allowing you to reach such a
> > > conclusion. You want to have an MPI_RECV with a buffer where multiple
> > > senders can do RMA operations. The only way this can be done in the
> > > context of the MPI standard is if each of the receives on this
> > > particular buffer are using non-contiguous datatypes. Thus, unlike
> > > what you suggest in your answer above, this is not hurting performance
> > > as you are already in a niche mode (I'm not even talking about the
> > > fact that usually non-contiguous datatypes conflicts with RMA
> > > operations). Moreover, you suppose that the detection of a dead
> > > process and the re-posting of the receive buffer can happen faster
> > > than an RMA message cross the network. The only potential case where
> > > such a scenario can happen is when multiple paths between the source
> > > and the destination exist, and the failure detection happen on one
> > > path while the RMA message took another one. This is highly improbable
> > > in most cases.
> > >  
> > > There are too many ifs in this scenario to make it plausible. Even if
> > > we suppose that all those ifs will be true, as Rich said, this is an
> > > issue of [quality of] implementation not MPI standard. A high quality
> > > MPI implementation will delay reporting the process failure error on
> > > that particular MPI_RECV until all possible RMA from the dead process
> > > were either discarded by the network, or written to the memory.
> > >  
> > > >  
> > > > However, please also think about this problem for other networks that
> > > > might
> > > > not have such hardware protection capabilities (K Computer comes to
> > > > mind).
> > > >  
> > >  
> > >  
> > >  
> > > K computer ? My understanding is that there are such capabilities in
> > > the TOFU network. I might be wrong thou, in which case I would
> > > definitively appreciate if you can you pinpoint me to a
> > > link/documentation that proves your point?
> > >  
> > > > Maybe they cannot provide MPI-specified FT, and that would be fine.
> > >  
> > >  
> > > Not really, FT can be supported without overhead for the normal
> > > execution even for the types of netwrok you mention. The solution I
> > > presented above, uses the timeouts of the network layer to ensure no
> > > delivery can occur after the error reporting, by delaying the error
> > > reporting until all timeout occurred. Trivial to implement, and
> > > without impact on the normal execution path.
> > >  
> > > Thanks,
> > > George.
> > >  
> > >  
> > > >  
> > > > -- Pavan
> > > >  
> > > >  
> > > > On 07/30/2013 02:59 PM, Sur, Sayantan wrote:
> > > > >  
> > > > >  
> > > > > Hi Wesley,
> > > > >  
> > > > > Looks like your attachment didn’t make it through. Using IB, one can
> > > > > generate rkeys for each sender and just invalidate the key for the
> > > > > observed failed process. HW can just drop the “slow” message when it
> > > > > arrives. I’m assuming that generating keys should be fast in the future
> > > > > given that recently announced HW/firmware has support for on-demand
> > > > > registration. In any case, it is not a restriction of IB per se.
> > > > >  
> > > > > Thanks,
> > > > >  
> > > > > Sayantan
> > > > >  
> > > > > *From:*mpi3-ft-bounces at lists.mpi-forum.org (mailto:mpi3-ft-bounces at lists.mpi-forum.org)
> > > > > [mailto:mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Wesley Bland
> > > > > *Sent:* Tuesday, July 30, 2013 11:04 AM
> > > > > *To:* MPI3-FT Working Group
> > > > > *Subject:* [Mpi3-ft] Problem with reusing rendezvous memory buffers
> > > > >  
> > > > >  
> > > > > Pavan pointed out a problem to me yesterday related to memory buffers
> > > > > used with rendezvous protocols. If a process passes a piece of memory to
> > > > > the library in an MPI_RECV and the library gives that memory to the
> > > > > hardware, where it is pinned, we can get into trouble if one of the
> > > > > processes that could write into that memory fails. The problem comes
> > > > > from a process sending a slow message and then dying. It is possible
> > > > > that the other processes could detect and handle the failure before the
> > > > > slow message arrives. Then when the message does arrive, it could
> > > > > corrupt the memory without the application having a way to handle this.
> > > > > My whiteboard example is attached as an image.
> > > > >  
> > > > > We can't just unmap memory from the NIC when a failure occurs because
> > > > > that memory is still being used by another process's message. Some
> > > > > hardware supports unmapping memory for specific senders which would
> > > > > solve this issue, but some don't, such as InfiniBand, where the memory
> > > > > region just has a key and unmapping it removes it for all senders.
> > > > >  
> > > > > This problem doesn't have a good solution (that I've come up with), but
> > > > > I did come up with a solution. We would need to introduce another error
> > > > > code (something like MPI_ERR_BUFFER_UNUSABLE) that would be able to tell
> > > > > the application that the buffer that the library was using is no longer
> > > > > usable because it might be corrupted. For some hardware, this wouldn't
> > > > > have to be returned, but for hardware where this isn't possible, the
> > > > > library could pass this error to the implementation to say that I need a
> > > > > new buffer in order to complete this operation. On the sender side, the
> > > > > operation would probably complete successfully since to it, the memory
> > > > > was still available. That means that there will be some rollback
> > > > > necessary, but that's up to the application to figure out.
> > > > >  
> > > > > I know this is an expensive and painful solution, but this is all I've
> > > > > come up with so far. Thoughts from the group?
> > > > >  
> > > > > Thanks,
> > > > >  
> > > > > Wesley
> > > > >  
> > > > >  
> > > > >  
> > > > > _______________________________________________
> > > > > mpi3-ft mailing list
> > > > > mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> > > > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > > > >  
> > > >  
> > > >  
> > > > --
> > > > Pavan Balaji
> > > > http://www.mcs.anl.gov/~balaji
> > > >  
> > > > _______________________________________________
> > > > mpi3-ft mailing list
> > > > mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> > > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > > >  
> > >  
> > >  
> > >  
> > > _______________________________________________
> > > mpi3-ft mailing list
> > > mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > >  
> >  
> >  
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >  
>  
>  
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org (mailto:mpi3-ft at lists.mpi-forum.org)
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130802/13147be5/attachment-0001.html>


More information about the mpiwg-ft mailing list