[Mpi3-ft] Question about MPI_ANY_SOURCE and process failures

Graham, Richard L. rlgraham at ornl.gov
Thu Aug 19 14:18:14 CDT 2010


I tend to agree - that "all proces" option does not make sense.

Rich

On 8/19/10 3:09 PM, "Josh Hursey" <jjhursey at open-mpi.org> wrote:

I think it should be if ANY process fails on the communicator then it should return an error.

I don't think it should be ALL, per your comments below. Additionally, ALL is not workable even if we do {ALL-self} since the intention of the receive is to receive one message, and the MPI interface cannot assume that the remaining processes will ever provide a message to the application.

NEVER is probably not what we want either for the same reason. If only a subset of the processes in the communicator will ever send a message (due to app. design) the MPI library does not know if the procs that have failed are important or not.

So I think the ANY process fail option is the only one that makes sense here.

The state of the 'status' object should point to one of the process failures. If there are concurrent failures it is hard to tell which was first, so the semantics should probably just say that any one of the failed processes will be identified. Then the user should use a 'validate' command to figure out which ones have failed.

-- Josh

P.S. I am starting to work on a slightly more formal fail-though proposal for the group. This separates the interface/semantic issues for fail-through from recovery. This will help get us through some of the broader issues of stability before complicating the discussion with (multiple, concurrent) recoveries. More on this in the next couple weeks.

On Aug 19, 2010, at 2:46 PM, Solt, David George wrote:

> err = MPI_Recv(....., rankX, ..., comm, status);
>
> if communication to rankX fails, this receive will return with err.
>
> Err = MPI_Recv(...., MPI_ANY_SOURCE, ..., comm, status);
>
> When does this MPI_Recv return a failure?   When ANY rank in comm is unreachable or when ALL ranks in comm are unreachable.   Since self is always reachable, the ALL option is really NEVER.
>
> We had been assuming ALL/NEVER but will likely change to ANY.   In such case, status points to the first failed rank that could have matched the request.
>
> Thanks,
> Dave
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>


_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft





More information about the mpiwg-ft mailing list