If we were to change the MPI_ANY_SOURCE semantics to only return when completed/matched (so the suggested proposal modification), would this make life more difficult on an intermediate layer virtualizing the collective operations?<div>

<br></div><div>I am open to the idea that calling a collective validate() also re-enables the posting of MPI_ANY_SOURCE. The reason we did not do that in the current proposal is because we were uncertain if someone would want to do a validate, but not re-enable the posting of MPI_ANY_SOURCE receives. Since reenabling ANY_SOURCE is a local/quick operation there was not performance argument to be made. Though I think you pose an interesting programability argument.</div>

<div><br></div><div>So can you elaborate a bit on your example (I just want to make sure I fully understand)? Would such an intermediate layer be using MPI_ANY_SOURCE in their collective operations and depend on the return-when-new-proc-failure semantic? So the intermediate library would have to reenable ANY_SOURCE, then mask the semantics for user initiated p2p operations over the same communicator. If we added the reenable_any_source semantics with the validate then the intermediate library would not have to virtualize the p2p communication. Is that on the right track?</div>

<div><br></div><div>Thanks,</div><div>Josh<br><br><div class="gmail_quote">On Wed, Jan 25, 2012 at 7:29 PM, Martin Schulz <span dir="ltr"><<a href="mailto:schulzm@llnl.gov">schulzm@llnl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi Josh, all,<br>

<br>

I agree, I think the current approach is fine. Blocking is likely to be more problematic in many cases, IMHO. However, I am still a bit worried about splitting the semantics for P2P and Collective routines. I don't see a reason why a communicator after a collective call to validate wouldn't support ANY_SOURCE. If it is split, though, any intermediate layer trying to replace collectives with P2P solutions (and there are plenty of tuning frameworks out there that try exactly that) have a hard time maintaing the same semantics in an error case.<br>


<br>

Martin<br>

<div><div class="h5"><br>

<br>

On Jan 25, 2012, at 3:06 PM, Howard Pritchard wrote:<br>

<br>

> Hi Josh,<br>

><br>

> Cray is okay with the semantics described in the current<br>

> FTWG proposal attached to the ticket.<br>

><br>

> We plan to just leverage the out-of-band system fault<br>

> detector software that currently kills jobs if<br>

> a node goes down that the job was running on.<br>

><br>

> Howard<br>

><br>

> Josh Hursey wrote:<br>

>> We really need to make a decision on semantics for MPI_ANY_SOURCE.<br>

>><br>

>> During the plenary session the MPI Forum had a problem with the current<br>

>> proposed semantics. The current proposal states (roughly) that<br>

>> MPI_ANY_SOURCE return when a failure emerges in the communicator. The<br>

>> MPI Forum read this as a strong requirement for -progress- (something<br>

>> the MPI standard tries to stay away from).<br>

>><br>

>> The alternative proposal is that a receive on MPI_ANY_SOURCE will block<br>

>> until completed with a message. This means that it will -not- return<br>

>> when a new failure has been encountered (even if the calling process is<br>

>> the only process left alive in the communicator). This does get around<br>

>> the concern about progress, but puts a large burden on the end user.<br>

>><br>

>><br>

>> There are a couple good use cases for MPI_ANY_SOURCE (grumble, grumble)<br>

>> - Manager/Worker applications, and easy load balancing when<br>

>> multiple incoming messages are expected. This blocking behavior makes<br>

>> the use of MPI_ANY_SOURCE dangerous for fault tolerant applications, and<br>

>> opens up another opportunity for deadlock.<br>

>><br>

>> For applications that want to use MPI_ANY_SOURCE and be fault tolerant<br>

>> they will need to build their own failure detector on top of MPI using<br>

>> directed point-to-point messages. A basic implementation might post<br>

>> MPI_Irecv()'s to each worker process with an unused tag, then poll on<br>

>> Testany(). If any of these requests complete in error<br>

>> (MPI_ERR_PROC_FAIL_STOP) then the target has failed and the application<br>

>> can take action. This user-level failure detector can (should) be<br>

>> implemented in a third-party library since failure detectors can be<br>

>> difficult to implement in a scalable manner.<br>

>><br>

>> In reality, the MPI library or the runtime system that supports MPI will<br>

>> already be doing something similar. Even for MPI_ERRORS_ARE_FATAL on<br>

>> MPI_COMM_WORLD, the underlying system must detect the process failure,<br>

>> and terminate all other processes in MPI_COMM_WORLD. So this represents<br>

>> a -detection- of the failure, and a -notification- of the failure<br>

>> throughout the system (though the notification is an order to<br>

>> terminate). For MPI_ERRORS_RETURN, the MPI will use this<br>

>> detection/notification functionality to reason about the state of the<br>

>> message traffic in the system. So it seems silly to force the user to<br>

>> duplicate this (nontrivial) detection/notification functionality on top<br>

>> of MPI, just to avoid the progress discussion.<br>

>><br>

>><br>

>> So that is a rough summary of the debate. If we are going to move<br>

>> forward, we need to make a decision on MPI_ANY_SOURCE. I would like to<br>

>> make such a decision before/during the next teleconf (Feb. 1).<br>

>><br>

>> I'm torn on this one, so I look forward to your comments.<br>

>><br>

>> -- Josh<br>

>><br>

>> --<br>

>> Joshua Hursey<br>

>> Postdoctoral Research Associate<br>

>> Oak Ridge National Laboratory<br>

>> <a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>

>><br>

><br>

><br>

> --<br>

> Howard Pritchard<br>

> Software Engineering<br>

> Cray, Inc.<br>

> _______________________________________________<br>

> mpi3-ft mailing list<br>

> <a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>

> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>

<br>

</div></div>________________________________________________________________________<br>

Martin Schulz, <a href="mailto:schulzm@llnl.gov">schulzm@llnl.gov</a>, <a href="http://people.llnl.gov/schulzm" target="_blank">http://people.llnl.gov/schulzm</a><br>

CASC @ Lawrence Livermore National Laboratory, Livermore, USA<br>

<div class="HOEnZb"><div class="h5"><br>

<br>

<br>

<br>

_______________________________________________<br>

mpi3-ft mailing list<br>

<a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>

<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>

<br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>Joshua Hursey<br>Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>


</div>