[Mpi3-ft] Question about MPI_ANY_SOURCE and process failures
Fab Tillier
ftillier at microsoft.com
Thu Aug 19 14:48:01 CDT 2010
Joshua Hursey wrote on Thu, 19 Aug 2010 at 12:09:54
> I think it should be if ANY process fails on the communicator then it
> should return an error.
>
> I don't think it should be ALL, per your comments below. Additionally,
> ALL is not workable even if we do {ALL-self} since the intention of the
> receive is to receive one message, and the MPI interface cannot assume
> that the remaining processes will ever provide a message to the
> application.
>
> NEVER is probably not what we want either for the same reason. If only
> a subset of the processes in the communicator will ever send a message
> (due to app. design) the MPI library does not know if the procs that
> have failed are important or not.
>
> So I think the ANY process fail option is the only one that makes sense
> here.
Erez had documented much of the behavior for this in his error reporting rules available here:
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/error_report_rules
For a receive from source = MPI_ANY_SOURCE, see example E2 - the MPI_Wait fails if any error is detected on the communicator. E3 is similar, but shows the blocking receive use case.
> The state of the 'status' object should point to one of the process
> failures. If there are concurrent failures it is hard to tell which was
> first, so the semantics should probably just say that any one of the
> failed processes will be identified. Then the user should use a
> 'validate' command to figure out which ones have failed.
>
> -- Josh
>
> P.S. I am starting to work on a slightly more formal fail-though
> proposal for the group. This separates the interface/semantic issues for
> fail-through from recovery. This will help get us through some of the
> broader issues of stability before complicating the discussion with
> (multiple, concurrent) recoveries. More on this in the next couple weeks.
I don't know how much the rules Erez put together would apply to this work, but I'd suspect it would serve as a good starting point.
-Fab
> On Aug 19, 2010, at 2:46 PM, Solt, David George wrote:
>
>> err = MPI_Recv(....., rankX, ..., comm, status);
>>
>> if communication to rankX fails, this receive will return with err.
>>
>> Err = MPI_Recv(...., MPI_ANY_SOURCE, ..., comm, status);
>>
>> When does this MPI_Recv return a failure? When ANY rank in comm is
> unreachable or when ALL ranks in comm are unreachable. Since self is
> always reachable, the ALL option is really NEVER.
>>
>> We had been assuming ALL/NEVER but will likely change to ANY. In
> such case, status points to the first failed rank that could have
> matched the request.
>>
>> Thanks,
>> Dave
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list