Maybe we should take a different tact on this thread. There are clearly two camps here: (1) MPI_ANY_SOURCE returns on process failure, (2) MPI_ANY_SOURCE blocks. We are all concerned about what can and should go into this round of the proposal so that we have a successful first reading during (hopefully) the next MPI Forum meeting. So our hearts are all in the right place, and we should not lose sight of that.<div>
<br></div><div>Most of the motivation for (2) is that there has been expressed concern about implying progress by the definition of (1). I have yet to hear anyone argue that (2) is better for the user, so the progress issue seems to be the central issue. As such, let us discuss the merits of the progress critique. Because if we stick with (1) we will need to have a well reasoned argument against this critique when it comes up again during the next reading. If we go with (2) then we will have to explain the merits of this critique to those that think (1) is the better choice. So either way we need a firm assessment of this issue.</div>
<div><br></div><div>So would someone like to start the discussion by advocating for the progress critique?</div><div><br></div><div>Maybe it would be useful to try to setup a teleconf to discuss just this issue - maybe trying to include those that raised concern during the meeting in the discussion.</div>
<div><br></div><div>Thanks,</div><div>Josh</div><div><br></div><div><br><div class="gmail_quote">On Thu, Jan 26, 2012 at 4:59 PM, Graham, Richard L. <span dir="ltr"><<a href="mailto:rlgraham@ornl.gov">rlgraham@ornl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Sayantan, for some reason you seem to be against what is being proposed for recovery when there is failure and a posted blocking wild card receive.<br>
<br>
Posting to self will require threads be active, and to implement at the user level the same type of monitoring that is needed at the MPI level. I am also pretty sure that in your current MPI implementation you have to check (does not have to be active - it can be passive at the MPI level) for failed processes, so that you can terminate cleanly on failure, so no new requirement here.<br>
<br>
I have worked on apps that would hang (even 100% of the time) under the scenario you are proposing, where all their receives are wild-carded, in the compute portion of the code. While this is not the way we may write apps, this is how some are written. There are good reasons for apps to use wild card receives, and some users have just written codes that way - whether we like it or not. The subject of removing wild card receives has come up many times, but it does not last long, as that is simply not practical.<br>
<br>
Rich<br>
<div class="HOEnZb"><div class="h5"><br>
On Jan 26, 2012, at 4:27 PM, Sur, Sayantan wrote:<br>
<br>
> Your point is well taken. But I do not agree with your assertion that the situation is "unrecoverable". For example, the app/lib using blocking recv could use another thread to post a self send and satisfy that blocking receive. I know this is not elegant, but still recoverable, right? :-)<br>
><br>
> I also disagree with your statement that this does not have impact on implementation cost. This specifically requires MPI library to have sustained interaction with process management subsystem to keep polling/watching the entire system to see where the failure occurred. It needs to do so at some pre-determined frequency, which may or may not be in the granularity the application requires. In the case where Recv(ANY) is posted on a sub communicator, you would also need to convey the participants of the sub communicator to the out-of-band system, which means you need to have the layout available. Another alternative is to send info on process failures to -everyone- and then have the MPI library pick the process to deliver this error to. IMHO, this is significant implementation complexity - not saying it can't be done TODAY, but what about at scale few years from now?<br>
><br>
> This is after all a compromise - if we (app writers + MPI community) can live with something less than what is ideal, should we go for it?<br>
><br>
> Thanks.<br>
><br>
> ===<br>
> Sayantan Sur, Ph.D.<br>
> Intel Corp.<br>
><br>
>> -----Original Message-----<br>
>> From: <a href="mailto:mpi3-ft-bounces@lists.mpi-forum.org">mpi3-ft-bounces@lists.mpi-forum.org</a> [mailto:<a href="mailto:mpi3-ft-">mpi3-ft-</a><br>
>> <a href="mailto:bounces@lists.mpi-forum.org">bounces@lists.mpi-forum.org</a>] On Behalf Of Aurélien Bouteiller<br>
>> Sent: Thursday, January 26, 2012 10:40 AM<br>
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group<br>
>> Subject: Re: [Mpi3-ft] MPI_ANY_SOURCE ... again...<br>
>><br>
>> In my opinion, a fault tolerant proposal in which a valid MPI program<br>
>> deadlocks, without a remedy, in case of failure, does not address the issue,<br>
>> and is worthless. I do not support the proposition that ANY_TAG should just<br>
>> not work. -Every- function of the MPI standard should have a well defined<br>
>> behavior in case of failure, that do not result in the application being in a<br>
>> unrecoverable state.<br>
>><br>
>> Moreover, the cost of handling any-source in inconsequent on<br>
>> implementation performance. The opposition are only of intellectual<br>
>> prettiness order; not implementation cost issues. This is not a good reason to<br>
>> drop functionality to the point where the proposal does not even tolerate<br>
>> failures of valid MPI applications anymore.<br>
>><br>
>><br>
>> Aurelien<br>
>><br>
>><br>
>> Le 26 janv. 2012 ą 11:24, Bronevetsky, Greg a écrit :<br>
>><br>
>>> Actually, I think that we can be backwards compatible. Define semantics to<br>
>> say that we don't guarantee that MPI_ANY_SOURCE will get unblocked due<br>
>> to failure but don't preclude this possibility.<br>
>>><br>
>>> I think I'm coming down on the blocks side as well but for a less pessimistic<br>
>> reason than Josh. Standards bodies are conservative for a reason: mistakes in<br>
>> standards are expensive. As such, if there is any feature that can be<br>
>> evaluated outside the standard before being included in the standard, then<br>
>> this is the preferable path. MPI_ANY_SOURCE returns is exactly such a<br>
>> feature. Sure, users will be harmed in the short term but if this is not quite<br>
>> the best semantics, then they'll be harmed in the long term.<br>
>>><br>
>>> As such, lets go for the most barebones spec we can come up with on top<br>
>> of which we can implement all the other functionality we think is important.<br>
>> This gives us the flexibility to try out several APIs and decide on which is best<br>
>> before we come back before the forum to standardize the full MPI-FT API. At<br>
>> that point in time we'll have done a significantly stronger evaluation, which<br>
>> will make it much more difficult for the forum to say no, even though the list<br>
>> of features will be significantly more extensive in that proposal.<br>
>>><br>
>>> Greg Bronevetsky<br>
>>> Lawrence Livermore National Lab<br>
>>> <a href="tel:%28925%29%20424-5756" value="+19254245756">(925) 424-5756</a><br>
>>> <a href="mailto:bronevetsky@llnl.gov">bronevetsky@llnl.gov</a><br>
>>> <a href="http://greg.bronevetsky.com" target="_blank">http://greg.bronevetsky.com</a><br>
>>><br>
>>> From: <a href="mailto:mpi3-ft-bounces@lists.mpi-forum.org">mpi3-ft-bounces@lists.mpi-forum.org</a><br>
>>> [mailto:<a href="mailto:mpi3-ft-bounces@lists.mpi-forum.org">mpi3-ft-bounces@lists.mpi-forum.org</a>] On Behalf OfJosh Hursey<br>
>>> Sent: Thursday, January 26, 2012 8:11 AM<br>
>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group<br>
>>> Subject: Re: [Mpi3-ft] MPI_ANY_SOURCE ... again...<br>
>>><br>
>>> I mentioned something similar to folks here yesterday. If we decide on the<br>
>> "MPI_ANY_SOURCE blocks" semantics then I would expose the failure<br>
>> notification through an Open MPI specific API (or message in a special<br>
>> context, as Greg suggests), and the third-party library could short circuit their<br>
>> implementation of the detector/notifier with this call. I would then strongly<br>
>> advocate that other MPI implementations do the same.<br>
>>><br>
>>> I think the distasteful aspect of the "MPI_ANY_SOURCE blocks" path is that<br>
>> if we are going to advocate that MPI implementations provide this type of<br>
>> interface anyway then why can we not just expose these semantics in the<br>
>> standard and be done with it. With the proposed work around, applications<br>
>> will be programming to the MPI standard and the semantics of this third-<br>
>> party library for MPI_ANY_SOURCE. So it just seems like we are working<br>
>> around a stalemate argument in the forum and the users end up suffering.<br>
>> However, if we stick with the "MPI_ANY_SOURCE returns on error"<br>
>> semantics we would have to thread the needle of the progress discussion<br>
>> (possible, but it may take quite a long time).<br>
>>><br>
>>> So this is where I'm torn. If there was a backwards compatible path<br>
>> between the "MPI_ANY_SOURCE blocks" to the "MPI_ANY_SOURCE returns<br>
>> on error" then it would be easier to go with the former and propose the<br>
>> latter in a separate ticket. Then have the progress discussion over the<br>
>> separate ticket, and not the general RTS proposal. Maybe that path is<br>
>> defining a new MPI_ANY_SORCE_RETURN_ON_PROC_FAIL wildcard.<br>
>>><br>
>>> I am, unfortunately, starting to lean towards the "MPI_ANY_SOURCE<br>
>> blocks" camp. I say 'unfortunately' because it hurts users, and that's not<br>
>> something we should be considering in my opinion...<br>
>>><br>
>>> Good comments, keep them coming.<br>
>>><br>
>>> -- Josh<br>
>>><br>
>>> On Thu, Jan 26, 2012 at 10:48 AM, Bronevetsky, Greg<br>
>> <<a href="mailto:bronevetsky1@llnl.gov">bronevetsky1@llnl.gov</a>> wrote:<br>
>>> I agree that the "MPI_ANY_SOURCE returns on error" semantics is better<br>
>> for users. However, if this is going to be a sticking point for the rest of the<br>
>> forum, it is not actually that difficult to fake this functionality on top of the<br>
>> "MPI_ANY_SOURCE blocks" semantics. Josh, you correctly pointed out that<br>
>> the MPI implementation should be able to leverage its own out of band<br>
>> failure detectors to implement the "returns" functionality but if that is the<br>
>> case, why can't the vendor provide an optional layer to the user that will do<br>
>> exactly the same thing but without messing with the MPI forum?<br>
>>><br>
>>> What I'm proposing is that vendors or any other software developers<br>
>> provide a failure notification layer that sends an MPI message on a pre-<br>
>> defined communicator to the process. They would also provide a PMPI layer<br>
>> that wraps MPI_Receive(MPI_ANY_SOURCE) so that it alternates between<br>
>> testing for the arrival of messages that match this operation and a failure<br>
>> notification message. If the former arrives first, the wrapper returns<br>
>> normally. If the latter arrives first, the original<br>
>> MPI_Receive(MPI_ANY_SOURCE) is cancelled and the call returns with an<br>
>> error. Conveniently, since the failure notifier and the PMPI layer are<br>
>> orthogonal, we can connect the application to any failure detector, making it<br>
>> possible to provide these for systems where the vendors are lazy.<br>
>>><br>
>>> Greg Bronevetsky<br>
>>> Lawrence Livermore National Lab<br>
>>> <a href="tel:%28925%29%20424-5756" value="+19254245756">(925) 424-5756</a><br>
>>> <a href="mailto:bronevetsky@llnl.gov">bronevetsky@llnl.gov</a><br>
>>> <a href="http://greg.bronevetsky.com" target="_blank">http://greg.bronevetsky.com</a><br>
>>><br>
>>><br>
>>>> -----Original Message-----<br>
>>>> From: <a href="mailto:mpi3-ft-bounces@lists.mpi-forum.org">mpi3-ft-bounces@lists.mpi-forum.org</a> [mailto:<a href="mailto:mpi3-ft-">mpi3-ft-</a><br>
>>>> <a href="mailto:bounces@lists.mpi-forum.org">bounces@lists.mpi-forum.org</a>] On Behalf Of Martin Schulz<br>
>>>> Sent: Wednesday, January 25, 2012 4:30 PM<br>
>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working<br>
>>>> Group<br>
>>>> Subject: Re: [Mpi3-ft] MPI_ANY_SOURCE ... again...<br>
>>>><br>
>>>> Hi Josh, all,<br>
>>>><br>
>>>> I agree, I think the current approach is fine. Blocking is likely to<br>
>>>> be more problematic in many cases, IMHO. However, I am still a bit<br>
>>>> worried about splitting the semantics for P2P and Collective<br>
>>>> routines. I don't see a reason why a communicator after a collective<br>
>>>> call to validate wouldn't support ANY_SOURCE. If it is split,<br>
>>>> though, any intermediate layer trying to replace collectives with<br>
>>>> P2P solutions (and there are plenty of tuning frameworks out there<br>
>>>> that try exactly that) have a hard time maintaing the same semantics in an<br>
>> error case.<br>
>>>><br>
>>>> Martin<br>
>>>><br>
>>>><br>
>>>> On Jan 25, 2012, at 3:06 PM, Howard Pritchard wrote:<br>
>>>><br>
>>>>> Hi Josh,<br>
>>>>><br>
>>>>> Cray is okay with the semantics described in the current FTWG<br>
>>>>> proposal attached to the ticket.<br>
>>>>><br>
>>>>> We plan to just leverage the out-of-band system fault detector<br>
>>>>> software that currently kills jobs if a node goes down that the<br>
>>>>> job was running on.<br>
>>>>><br>
>>>>> Howard<br>
>>>>><br>
>>>>> Josh Hursey wrote:<br>
>>>>>> We really need to make a decision on semantics for<br>
>> MPI_ANY_SOURCE.<br>
>>>>>><br>
>>>>>> During the plenary session the MPI Forum had a problem with the<br>
>>>>>> current proposed semantics. The current proposal states (roughly)<br>
>>>>>> that MPI_ANY_SOURCE return when a failure emerges in the<br>
>>>>>> communicator. The MPI Forum read this as a strong requirement for<br>
>>>>>> -progress- (something the MPI standard tries to stay away from).<br>
>>>>>><br>
>>>>>> The alternative proposal is that a receive on MPI_ANY_SOURCE will<br>
>>>>>> block until completed with a message. This means that it will<br>
>>>>>> -not- return when a new failure has been encountered (even if the<br>
>>>>>> calling process is the only process left alive in the<br>
>>>>>> communicator). This does get around the concern about progress,<br>
>>>>>> but puts a large burden on<br>
>>>> the end user.<br>
>>>>>><br>
>>>>>><br>
>>>>>> There are a couple good use cases for MPI_ANY_SOURCE (grumble,<br>
>>>>>> grumble)<br>
>>>>>> - Manager/Worker applications, and easy load balancing when<br>
>>>>>> multiple incoming messages are expected. This blocking behavior<br>
>>>>>> makes the use of MPI_ANY_SOURCE dangerous for fault tolerant<br>
>>>>>> applications, and opens up another opportunity for deadlock.<br>
>>>>>><br>
>>>>>> For applications that want to use MPI_ANY_SOURCE and be fault<br>
>>>>>> tolerant they will need to build their own failure detector on<br>
>>>>>> top of MPI using directed point-to-point messages. A basic<br>
>>>>>> implementation might post MPI_Irecv()'s to each worker process<br>
>>>>>> with an unused tag, then poll on Testany(). If any of these<br>
>>>>>> requests complete in error<br>
>>>>>> (MPI_ERR_PROC_FAIL_STOP) then the target has failed and the<br>
>>>>>> application can take action. This user-level failure detector can<br>
>>>>>> (should) be implemented in a third-party library since failure<br>
>>>>>> detectors can be difficult to implement in a scalable manner.<br>
>>>>>><br>
>>>>>> In reality, the MPI library or the runtime system that supports<br>
>>>>>> MPI will already be doing something similar. Even for<br>
>>>>>> MPI_ERRORS_ARE_FATAL on MPI_COMM_WORLD, the underlying<br>
>> system<br>
>>>> must<br>
>>>>>> detect the process failure, and terminate all other processes in<br>
>>>>>> MPI_COMM_WORLD. So this represents a -detection- of the failure,<br>
>>>>>> and a -notification- of the failure throughout the system (though<br>
>>>>>> the notification is an order to terminate). For<br>
>>>>>> MPI_ERRORS_RETURN, the MPI will use this detection/notification<br>
>>>>>> functionality to reason about the state of the message traffic in<br>
>>>>>> the system. So it seems silly to force the user to duplicate this<br>
>>>>>> (nontrivial) detection/notification functionality on top of MPI,<br>
>>>>>> just to avoid the<br>
>>>> progress discussion.<br>
>>>>>><br>
>>>>>><br>
>>>>>> So that is a rough summary of the debate. If we are going to move<br>
>>>>>> forward, we need to make a decision on MPI_ANY_SOURCE. I would<br>
>>>>>> like to make such a decision before/during the next teleconf (Feb. 1).<br>
>>>>>><br>
>>>>>> I'm torn on this one, so I look forward to your comments.<br>
>>>>>><br>
>>>>>> -- Josh<br>
>>>>>><br>
>>>>>> --<br>
>>>>>> Joshua Hursey<br>
>>>>>> Postdoctoral Research Associate<br>
>>>>>> Oak Ridge National Laboratory<br>
>>>>>> <a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>
>>>>>><br>
>>>>><br>
>>>>><br>
>>>>> --<br>
>>>>> Howard Pritchard<br>
>>>>> Software Engineering<br>
>>>>> Cray, Inc.<br>
>>>>> _______________________________________________<br>
>>>>> mpi3-ft mailing list<br>
>>>>> <a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>
>>>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>
>>>><br>
>>>><br>
>> __________________________________________________________<br>
>> ____<br>
>>>> __________<br>
>>>> Martin Schulz, <a href="mailto:schulzm@llnl.gov">schulzm@llnl.gov</a>, <a href="http://people.llnl.gov/schulzm" target="_blank">http://people.llnl.gov/schulzm</a> CASC<br>
>>>> @ Lawrence Livermore National Laboratory, Livermore, USA<br>
>>>><br>
>>>><br>
>>>><br>
>>>><br>
>>>> _______________________________________________<br>
>>>> mpi3-ft mailing list<br>
>>>> <a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>
>>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>
>>><br>
>>> _______________________________________________<br>
>>> mpi3-ft mailing list<br>
>>> <a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>
>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>
>>><br>
>>><br>
>>><br>
>>><br>
>>> --<br>
>>> Joshua Hursey<br>
>>> Postdoctoral Research Associate<br>
>>> Oak Ridge National Laboratory<br>
>>> <a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>
>>> _______________________________________________<br>
>>> mpi3-ft mailing list<br>
>>> <a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>
>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>
>><br>
>> --<br>
>> * Dr. Aurélien Bouteiller<br>
>> * Researcher at Innovative Computing Laboratory<br>
>> * University of Tennessee<br>
>> * 1122 Volunteer Boulevard, suite 350<br>
>> * Knoxville, TN 37996<br>
>> * <a href="tel:865%20974%206321" value="+18659746321">865 974 6321</a><br>
>><br>
>><br>
>><br>
>><br>
><br>
><br>
> _______________________________________________<br>
> mpi3-ft mailing list<br>
> <a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>
> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>
<br>
<br>
_______________________________________________<br>
mpi3-ft mailing list<br>
<a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>
<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br>
<br>
</div></div></blockquote></div><br></div><br clear="all"><div><br></div>-- <br>Joshua Hursey<br>Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>