[Mpi3-ft] Problem with reusing rendezvous memory buffers

Thu Aug 1 05:09:42 CDT 2013

On Wed, Jul 31, 2013 at 7:28 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> George,
>
> I'm not talking about RMA here.  I'm just talking about send/recv.  I'm also
> not talking about simultaneous receives posted on the same buffer.  They are
> separated by failure notifications, COMM_AGREE, whatever else you need.  I
> think Sayantan got the point Wesley mentioned and his solution is correct
> (assuming enough support from the network to do so).  Though I'm not
> convinced it's not adding overhead.  Sayantan should comment on this.

If you're not talking RMA then Sayantan is absolutely right, this is a
non issue as any message from a process considered as dead should be
discarded. In fact the current ULFM implementation already take in
account such cases, as they can arrive in any multi-rail situation.

  George.

> With respect to the K Computer, I don't have a link.  This was my
> understanding from what the Fujitsu folks mentioned (and was the reason why
> they didn't want to release their low-level API; since the hardware has no
> protection, anyone could write to anyone's memory).  And I'm not trying to
> prove that K computer is not good enough in any way.  That was just an
> example.  My point was only that you should consider the case that not all
> networks would have such protection capabilities.
>
>  -- Pavan
>
>
> On 07/31/2013 11:50 AM, George Bosilca wrote:
>>
>> On Tue, Jul 30, 2013 at 10:47 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>>
>>>
>>> Hmm.  Yes, you are right.  Generating different per-process rkeys is an
>>> option on IB.  Though that's obviously less scalable than a single rkey
>>> and
>>> hurts performance too because a new rkey has to be generated for each
>>> process.  Even more of a reason for FT to have a requested/provided
>>> option
>>> like threads.
>>
>>
>> I fail to understand the scenario allowing you to reach such a
>> conclusion. You want to have an MPI_RECV with a buffer where multiple
>> senders can do RMA operations. The only way this can be done in the
>> context of the MPI standard is if each of the receives on this
>> particular buffer are using non-contiguous datatypes. Thus, unlike
>> what you suggest in your answer above, this is not hurting performance
>> as you are already in a niche mode (I'm not even talking about the
>> fact that usually non-contiguous datatypes conflicts with RMA
>> operations). Moreover, you suppose that the detection of a dead
>> process and the re-posting of the receive buffer can happen faster
>> than an RMA message cross the network. The only potential case where
>> such a scenario can happen is when multiple paths between the source
>> and the destination exist, and the failure detection happen on one
>> path while the RMA message took another one. This is highly improbable
>> in most cases.
>>
>> There are too many ifs in this scenario to make it plausible. Even if
>> we suppose that all those ifs will be true, as Rich said, this is an
>> issue of [quality of] implementation not MPI standard. A high quality
>> MPI implementation will delay reporting the process failure error on
>> that particular MPI_RECV until all possible RMA from the dead process
>> were either discarded by the network, or written to the memory.
>>
>>>
>>> However, please also think about this problem for other networks that
>>> might
>>> not have such hardware protection capabilities (K Computer comes to
>>> mind).
>>
>>
>> K computer ? My understanding is that there are such capabilities in
>> the TOFU network. I might be wrong thou, in which case I would
>> definitively appreciate if you can you pinpoint me to a
>> link/documentation that proves your point?
>>
>>> Maybe they cannot provide MPI-specified FT, and that would be fine.
>>
>>
>> Not really, FT can be supported without overhead for the normal
>> execution even for the types of netwrok you mention. The solution I
>> presented above, uses the timeouts of the network layer to ensure no
>> delivery can occur after the error reporting, by delaying the error
>> reporting until all timeout occurred. Trivial to implement, and
>> without impact on the normal execution path.
>>
>>    Thanks,
>>      George.
>>
>>
>>>
>>>   -- Pavan
>>>
>>>
>>> On 07/30/2013 02:59 PM, Sur, Sayantan wrote:
>>>>
>>>>
>>>> Hi Wesley,
>>>>
>>>> Looks like your attachment didn’t make it through. Using IB, one can
>>>> generate rkeys for each sender and just invalidate the key for the
>>>> observed failed process. HW can just drop the “slow” message when it
>>>> arrives. I’m assuming that generating keys should be fast in the future
>>>> given that recently announced HW/firmware has support for on-demand
>>>> registration. In any case, it is not a restriction of IB per se.
>>>>
>>>> Thanks,
>>>>
>>>> Sayantan
>>>>
>>>> *From:*mpi3-ft-bounces at lists.mpi-forum.org
>>>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Wesley Bland
>>>> *Sent:* Tuesday, July 30, 2013 11:04 AM
>>>> *To:* MPI3-FT Working Group
>>>> *Subject:* [Mpi3-ft] Problem with reusing rendezvous memory buffers
>>>>
>>>>
>>>> Pavan pointed out a problem to me yesterday related to memory buffers
>>>> used with rendezvous protocols. If a process passes a piece of memory to
>>>> the library in an MPI_RECV and the library gives that memory to the
>>>> hardware, where it is pinned, we can get into trouble if one of the
>>>> processes that could write into that memory fails. The problem comes
>>>> from a process sending a slow message and then dying. It is possible
>>>> that the other processes could detect and handle the failure before the
>>>> slow message arrives. Then when the message does arrive, it could
>>>> corrupt the memory without the application having a way to handle this.
>>>> My whiteboard example is attached as an image.
>>>>
>>>> We can't just unmap memory from the NIC when a failure occurs because
>>>> that memory is still being used by another process's message. Some
>>>> hardware supports unmapping memory for specific senders which would
>>>> solve this issue, but some don't, such as InfiniBand, where the memory
>>>> region just has a key and unmapping it removes it for all senders.
>>>>
>>>> This problem doesn't have a good solution (that I've come up with), but
>>>> I did come up with a solution. We would need to introduce another error
>>>> code (something like MPI_ERR_BUFFER_UNUSABLE) that would be able to tell
>>>> the application that the buffer that the library was using is no longer
>>>> usable because it might be corrupted. For some hardware, this wouldn't
>>>> have to be returned, but for hardware where this isn't possible, the
>>>> library could pass this error to the implementation to say that I need a
>>>> new buffer in order to complete this operation. On the sender side, the
>>>> operation would probably complete successfully since to it, the memory
>>>> was still available. That means that there will be some rollback
>>>> necessary, but that's up to the application to figure out.
>>>>
>>>> I know this is an expensive and painful solution, but this is all I've
>>>> come up with so far. Thoughts from the group?
>>>>
>>>> Thanks,
>>>>
>>>> Wesley
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji