[Mpi3-ft] MPI_ANY_SOURCE ... again...
herault.thomas at gmail.com
Fri Jan 27 09:10:17 CST 2012
Le 27 janv. 2012 à 09:46, Josh Hursey a écrit :
> Maybe we should take a different tact on this thread. There are clearly two camps here: (1) MPI_ANY_SOURCE returns on process failure, (2) MPI_ANY_SOURCE blocks. We are all concerned about what can and should go into this round of the proposal so that we have a successful first reading during (hopefully) the next MPI Forum meeting. So our hearts are all in the right place, and we should not lose sight of that.
> Most of the motivation for (2) is that there has been expressed concern about implying progress by the definition of (1). I have yet to hear anyone argue that (2) is better for the user, so the progress issue seems to be the central issue. As such, let us discuss the merits of the progress critique. Because if we stick with (1) we will need to have a well reasoned argument against this critique when it comes up again during the next reading. If we go with (2) then we will have to explain the merits of this critique to those that think (1) is the better choice. So either way we need a firm assessment of this issue.
> So would someone like to start the discussion by advocating for the progress critique?
> Maybe it would be useful to try to setup a teleconf to discuss just this issue - maybe trying to include those that raised concern during the meeting in the discussion.
About "absence of progress": consider the case of a named receive where the sender is in arbitrary long non-MPI routine. In this case, do we expect the library to be able to differentiate between a failure and a living process that is not doing its send (yet) for internal reasons? I believe we should. The hardware / distributed operating system gives us the information that the link is still not broken, and we can still hope to receive the message, so we wait.
I agree that not expecting an asynchronous progress when nobody is in a MPI call is a sensitive approach. But our case is very different: one of the peer *is* in a blocking MPI call. This peer can do whatever is needed to get some progress, including collaborating with the low-level system (that's one of the roles of the library) to check the validity of the link. So, I definitely advocate an active mechanism, in the library, to detect a potential deadlock in the ANY_SRC case, because we need the basic block in the named receive case anyway.
More information about the mpiwg-ft