[Mpi3-ft] Problem with reusing rendezvous memory buffers

Wed Jul 31 12:55:02 CDT 2013

> I'm not talking about RMA here.  I'm just talking about send/recv.  I'm also
> not talking about simultaneous receives posted on the same buffer.
>   They are separated by failure notifications, COMM_AGREE, whatever else
> you need.  I think Sayantan got the point Wesley mentioned and his solution
> is correct (assuming enough support from the network to do so).
>   Though I'm not convinced it's not adding overhead.  Sayantan should
> comment on this.
>

Just repeating my email yesterday just in case it got lost.

" I think this problem doesn't exist (even without the per-process key).

The target process will destroy the connection to the "observed failed" process. That itself is sufficient for all networks that I'm aware of to be able to discard the late message when it arrives. Basically, the late message arrives on a dead connection/endpoint."

I will also support Rich's view that this is a network implementation issue. As far as the SW is concerned it should just destroy the connection/endpoint and unless it is a completely broken network, it should reject packets from that source once the connection is destroyed.

Thanks,
Sayantan.

> With respect to the K Computer, I don't have a link.  This was my
> understanding from what the Fujitsu folks mentioned (and was the reason
> why they didn't want to release their low-level API; since the hardware has
> no protection, anyone could write to anyone's memory).  And I'm not trying
> to prove that K computer is not good enough in any way.  That was just an
> example.  My point was only that you should consider the case that not all
> networks would have such protection capabilities.
> 
>   -- Pavan
> 
> On 07/31/2013 11:50 AM, George Bosilca wrote:
> > On Tue, Jul 30, 2013 at 10:47 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> >>
> >> Hmm.  Yes, you are right.  Generating different per-process rkeys is
> >> an option on IB.  Though that's obviously less scalable than a single
> >> rkey and hurts performance too because a new rkey has to be generated
> >> for each process.  Even more of a reason for FT to have a
> >> requested/provided option like threads.
> >
> > I fail to understand the scenario allowing you to reach such a
> > conclusion. You want to have an MPI_RECV with a buffer where multiple
> > senders can do RMA operations. The only way this can be done in the
> > context of the MPI standard is if each of the receives on this
> > particular buffer are using non-contiguous datatypes. Thus, unlike
> > what you suggest in your answer above, this is not hurting performance
> > as you are already in a niche mode (I'm not even talking about the
> > fact that usually non-contiguous datatypes conflicts with RMA
> > operations). Moreover, you suppose that the detection of a dead
> > process and the re-posting of the receive buffer can happen faster
> > than an RMA message cross the network. The only potential case where
> > such a scenario can happen is when multiple paths between the source
> > and the destination exist, and the failure detection happen on one
> > path while the RMA message took another one. This is highly improbable
> > in most cases.
> >
> > There are too many ifs in this scenario to make it plausible. Even if
> > we suppose that all those ifs will be true, as Rich said, this is an
> > issue of [quality of] implementation not MPI standard. A high quality
> > MPI implementation will delay reporting the process failure error on
> > that particular MPI_RECV until all possible RMA from the dead process
> > were either discarded by the network, or written to the memory.
> >
> >>
> >> However, please also think about this problem for other networks that
> >> might not have such hardware protection capabilities (K Computer comes
> to mind).
> >
> > K computer ? My understanding is that there are such capabilities in
> > the TOFU network. I might be wrong thou, in which case I would
> > definitively appreciate if you can you pinpoint me to a
> > link/documentation that proves your point?
> >
> >> Maybe they cannot provide MPI-specified FT, and that would be fine.
> >
> > Not really, FT can be supported without overhead for the normal
> > execution even for the types of netwrok you mention. The solution I
> > presented above, uses the timeouts of the network layer to ensure no
> > delivery can occur after the error reporting, by delaying the error
> > reporting until all timeout occurred. Trivial to implement, and
> > without impact on the normal execution path.
> >
> >    Thanks,
> >      George.
> >
> >
> >>
> >>   -- Pavan
> >>
> >>
> >> On 07/30/2013 02:59 PM, Sur, Sayantan wrote:
> >>>
> >>> Hi Wesley,
> >>>
> >>> Looks like your attachment didn't make it through. Using IB, one can
> >>> generate rkeys for each sender and just invalidate the key for the
> >>> observed failed process. HW can just drop the "slow" message when it
> >>> arrives. I'm assuming that generating keys should be fast in the
> >>> future given that recently announced HW/firmware has support for
> >>> on-demand registration. In any case, it is not a restriction of IB per se.
> >>>
> >>> Thanks,
> >>>
> >>> Sayantan
> >>>
> >>> *From:*mpi3-ft-bounces at lists.mpi-forum.org
> >>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Wesley
> >>> Bland
> >>> *Sent:* Tuesday, July 30, 2013 11:04 AM
> >>> *To:* MPI3-FT Working Group
> >>> *Subject:* [Mpi3-ft] Problem with reusing rendezvous memory buffers
> >>>
> >>>
> >>> Pavan pointed out a problem to me yesterday related to memory
> >>> buffers used with rendezvous protocols. If a process passes a piece
> >>> of memory to the library in an MPI_RECV and the library gives that
> >>> memory to the hardware, where it is pinned, we can get into trouble
> >>> if one of the processes that could write into that memory fails. The
> >>> problem comes from a process sending a slow message and then dying.
> >>> It is possible that the other processes could detect and handle the
> >>> failure before the slow message arrives. Then when the message does
> >>> arrive, it could corrupt the memory without the application having a way
> to handle this.
> >>> My whiteboard example is attached as an image.
> >>>
> >>> We can't just unmap memory from the NIC when a failure occurs
> >>> because that memory is still being used by another process's
> >>> message. Some hardware supports unmapping memory for specific
> >>> senders which would solve this issue, but some don't, such as
> >>> InfiniBand, where the memory region just has a key and unmapping it
> removes it for all senders.
> >>>
> >>> This problem doesn't have a good solution (that I've come up with),
> >>> but I did come up with a solution. We would need to introduce
> >>> another error code (something like MPI_ERR_BUFFER_UNUSABLE) that
> >>> would be able to tell the application that the buffer that the
> >>> library was using is no longer usable because it might be corrupted.
> >>> For some hardware, this wouldn't have to be returned, but for
> >>> hardware where this isn't possible, the library could pass this
> >>> error to the implementation to say that I need a new buffer in order
> >>> to complete this operation. On the sender side, the operation would
> >>> probably complete successfully since to it, the memory was still
> >>> available. That means that there will be some rollback necessary, but
> that's up to the application to figure out.
> >>>
> >>> I know this is an expensive and painful solution, but this is all
> >>> I've come up with so far. Thoughts from the group?
> >>>
> >>> Thanks,
> >>>
> >>> Wesley
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mpi3-ft mailing list
> >>> mpi3-ft at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>
> >> --
> >> Pavan Balaji
> >> http://www.mcs.anl.gov/~balaji
> >>
> >> _______________________________________________
> >> mpi3-ft mailing list
> >> mpi3-ft at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft