[Mpi3-ft] Problem with reusing rendezvous memory buffers

Wed Jul 31 19:25:04 CDT 2013

PSB

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Wesley Bland
Sent: Wednesday, July 31, 2013 1:31 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Problem with reusing rendezvous memory buffers

On Wednesday, July 31, 2013 at 11:50 AM, George Bosilca wrote:
On Tue, Jul 30, 2013 at 10:47 PM, Pavan Balaji <balaji at mcs.anl.gov<mailto:balaji at mcs.anl.gov>> wrote:

Hmm. Yes, you are right. Generating different per-process rkeys is an
option on IB. Though that's obviously less scalable than a single rkey and
hurts performance too because a new rkey has to be generated for each
process. Even more of a reason for FT to have a requested/provided option
like threads.

I fail to understand the scenario allowing you to reach such a
conclusion. You want to have an MPI_RECV with a buffer where multiple
senders can do RMA operations. The only way this can be done in the
context of the MPI standard is if each of the receives on this
particular buffer are using non-contiguous datatypes. Thus, unlike
what you suggest in your answer above, this is not hurting performance
as you are already in a niche mode (I'm not even talking about the
fact that usually non-contiguous datatypes conflicts with RMA
operations). Moreover, you suppose that the detection of a dead
process and the re-posting of the receive buffer can happen faster
than an RMA message cross the network. The only potential case where
such a scenario can happen is when multiple paths between the source
and the destination exist, and the failure detection happen on one
path while the RMA message took another one. This is highly improbable
in most cases.
This isn't necessarily about RMA. This can happen with send/receive as well. I don't disagree that this scenario is unlikely, but it could result in a case where the user ends up with bad data and can't do anything about it.

[rich] I think that what you are worried about protecting is an OS that  does not provide adequate process protection.  I don’t know if we want to go there.

There are too many ifs in this scenario to make it plausible. Even if
we suppose that all those ifs will be true, as Rich said, this is an
issue of [quality of] implementation not MPI standard. A high quality
MPI implementation will delay reporting the process failure error on
that particular MPI_RECV until all possible RMA from the dead process
were either discarded by the network, or written to the memory.

However, please also think about this problem for other networks that might
not have such hardware protection capabilities (K Computer comes to mind).

K computer ? My understanding is that there are such capabilities in
the TOFU network. I might be wrong thou, in which case I would
definitively appreciate if you can you pinpoint me to a
link/documentation that proves your point?

Maybe they cannot provide MPI-specified FT, and that would be fine.

Not really, FT can be supported without overhead for the normal
execution even for the types of netwrok you mention. The solution I
presented above, uses the timeouts of the network layer to ensure no
delivery can occur after the error reporting, by delaying the error
reporting until all timeout occurred. Trivial to implement, and
without impact on the normal execution path.

Thanks,
George.

-- Pavan

On 07/30/2013 02:59 PM, Sur, Sayantan wrote:

Hi Wesley,

Looks like your attachment didn’t make it through. Using IB, one can
generate rkeys for each sender and just invalidate the key for the
observed failed process. HW can just drop the “slow” message when it
arrives. I’m assuming that generating keys should be fast in the future
given that recently announced HW/firmware has support for on-demand
registration. In any case, it is not a restriction of IB per se.

Thanks,

Sayantan

*From:*mpi3-ft-bounces at lists.mpi-forum.org<mailto:mpi3-ft-bounces at lists.mpi-forum.org>
[mailto:mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Wesley Bland
*Sent:* Tuesday, July 30, 2013 11:04 AM
*To:* MPI3-FT Working Group
*Subject:* [Mpi3-ft] Problem with reusing rendezvous memory buffers

Pavan pointed out a problem to me yesterday related to memory buffers
used with rendezvous protocols. If a process passes a piece of memory to
the library in an MPI_RECV and the library gives that memory to the
hardware, where it is pinned, we can get into trouble if one of the
processes that could write into that memory fails. The problem comes
from a process sending a slow message and then dying. It is possible
that the other processes could detect and handle the failure before the
slow message arrives. Then when the message does arrive, it could
corrupt the memory without the application having a way to handle this.
My whiteboard example is attached as an image.

We can't just unmap memory from the NIC when a failure occurs because
that memory is still being used by another process's message. Some
hardware supports unmapping memory for specific senders which would
solve this issue, but some don't, such as InfiniBand, where the memory
region just has a key and unmapping it removes it for all senders.

This problem doesn't have a good solution (that I've come up with), but
I did come up with a solution. We would need to introduce another error
code (something like MPI_ERR_BUFFER_UNUSABLE) that would be able to tell
the application that the buffer that the library was using is no longer
usable because it might be corrupted. For some hardware, this wouldn't
have to be returned, but for hardware where this isn't possible, the
library could pass this error to the implementation to say that I need a
new buffer in order to complete this operation. On the sender side, the
operation would probably complete successfully since to it, the memory
was still available. That means that there will be some rollback
necessary, but that's up to the application to figure out.

I know this is an expensive and painful solution, but this is all I've
come up with so far. Thoughts from the group?

Thanks,

Wesley

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org<mailto:mpi3-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org<mailto:mpi3-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org<mailto:mpi3-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130801/1dfa98dc/attachment-0001.html>