[Mpi3-ft] MPI_ANY_SOURCE ... again...
Martin Schulz
schulzm at llnl.gov
Wed Jan 25 18:29:36 CST 2012
Hi Josh, all,
I agree, I think the current approach is fine. Blocking is likely to be more problematic in many cases, IMHO. However, I am still a bit worried about splitting the semantics for P2P and Collective routines. I don't see a reason why a communicator after a collective call to validate wouldn't support ANY_SOURCE. If it is split, though, any intermediate layer trying to replace collectives with P2P solutions (and there are plenty of tuning frameworks out there that try exactly that) have a hard time maintaing the same semantics in an error case.
Martin
On Jan 25, 2012, at 3:06 PM, Howard Pritchard wrote:
> Hi Josh,
>
> Cray is okay with the semantics described in the current
> FTWG proposal attached to the ticket.
>
> We plan to just leverage the out-of-band system fault
> detector software that currently kills jobs if
> a node goes down that the job was running on.
>
> Howard
>
> Josh Hursey wrote:
>> We really need to make a decision on semantics for MPI_ANY_SOURCE.
>>
>> During the plenary session the MPI Forum had a problem with the current
>> proposed semantics. The current proposal states (roughly) that
>> MPI_ANY_SOURCE return when a failure emerges in the communicator. The
>> MPI Forum read this as a strong requirement for -progress- (something
>> the MPI standard tries to stay away from).
>>
>> The alternative proposal is that a receive on MPI_ANY_SOURCE will block
>> until completed with a message. This means that it will -not- return
>> when a new failure has been encountered (even if the calling process is
>> the only process left alive in the communicator). This does get around
>> the concern about progress, but puts a large burden on the end user.
>>
>>
>> There are a couple good use cases for MPI_ANY_SOURCE (grumble, grumble)
>> - Manager/Worker applications, and easy load balancing when
>> multiple incoming messages are expected. This blocking behavior makes
>> the use of MPI_ANY_SOURCE dangerous for fault tolerant applications, and
>> opens up another opportunity for deadlock.
>>
>> For applications that want to use MPI_ANY_SOURCE and be fault tolerant
>> they will need to build their own failure detector on top of MPI using
>> directed point-to-point messages. A basic implementation might post
>> MPI_Irecv()'s to each worker process with an unused tag, then poll on
>> Testany(). If any of these requests complete in error
>> (MPI_ERR_PROC_FAIL_STOP) then the target has failed and the application
>> can take action. This user-level failure detector can (should) be
>> implemented in a third-party library since failure detectors can be
>> difficult to implement in a scalable manner.
>>
>> In reality, the MPI library or the runtime system that supports MPI will
>> already be doing something similar. Even for MPI_ERRORS_ARE_FATAL on
>> MPI_COMM_WORLD, the underlying system must detect the process failure,
>> and terminate all other processes in MPI_COMM_WORLD. So this represents
>> a -detection- of the failure, and a -notification- of the failure
>> throughout the system (though the notification is an order to
>> terminate). For MPI_ERRORS_RETURN, the MPI will use this
>> detection/notification functionality to reason about the state of the
>> message traffic in the system. So it seems silly to force the user to
>> duplicate this (nontrivial) detection/notification functionality on top
>> of MPI, just to avoid the progress discussion.
>>
>>
>> So that is a rough summary of the debate. If we are going to move
>> forward, we need to make a decision on MPI_ANY_SOURCE. I would like to
>> make such a decision before/during the next teleconf (Feb. 1).
>>
>> I'm torn on this one, so I look forward to your comments.
>>
>> -- Josh
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>
>
> --
> Howard Pritchard
> Software Engineering
> Cray, Inc.
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
________________________________________________________________________
Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
CASC @ Lawrence Livermore National Laboratory, Livermore, USA
More information about the mpiwg-ft
mailing list