[Mpi3-ft] Question about MPI_ANY_SOURCE and process failures

Fab Tillier ftillier at microsoft.com
Thu Aug 19 14:48:01 CDT 2010


Joshua Hursey wrote on Thu, 19 Aug 2010 at 12:09:54

> I think it should be if ANY process fails on the communicator then it
> should return an error.
> 
> I don't think it should be ALL, per your comments below. Additionally,
> ALL is not workable even if we do {ALL-self} since the intention of the
> receive is to receive one message, and the MPI interface cannot assume
> that the remaining processes will ever provide a message to the
> application.
> 
> NEVER is probably not what we want either for the same reason. If only
> a subset of the processes in the communicator will ever send a message
> (due to app. design) the MPI library does not know if the procs that
> have failed are important or not.
> 
> So I think the ANY process fail option is the only one that makes sense
> here.

Erez had documented much of the behavior for this in his error reporting rules available here:

https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/error_report_rules

For a receive from source = MPI_ANY_SOURCE, see example E2 - the MPI_Wait fails if any error is detected on the communicator.  E3 is similar, but shows the blocking receive use case.

> The state of the 'status' object should point to one of the process
> failures. If there are concurrent failures it is hard to tell which was
> first, so the semantics should probably just say that any one of the
> failed processes will be identified. Then the user should use a
> 'validate' command to figure out which ones have failed.
> 
> -- Josh
> 
> P.S. I am starting to work on a slightly more formal fail-though
> proposal for the group. This separates the interface/semantic issues for
> fail-through from recovery. This will help get us through some of the
> broader issues of stability before complicating the discussion with
> (multiple, concurrent) recoveries. More on this in the next couple weeks.

I don't know how much the rules Erez put together would apply to this work, but I'd suspect it would serve as a good starting point.

-Fab

> On Aug 19, 2010, at 2:46 PM, Solt, David George wrote:
> 
>> err = MPI_Recv(....., rankX, ..., comm, status);
>> 
>> if communication to rankX fails, this receive will return with err.
>> 
>> Err = MPI_Recv(...., MPI_ANY_SOURCE, ..., comm, status);
>> 
>> When does this MPI_Recv return a failure?   When ANY rank in comm is
> unreachable or when ALL ranks in comm are unreachable.   Since self is
> always reachable, the ALL option is really NEVER.
>> 
>> We had been assuming ALL/NEVER but will likely change to ANY.   In
> such case, status points to the first failed rank that could have
> matched the request.
>> 
>> Thanks,
>> Dave
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft




More information about the mpiwg-ft mailing list