[Mpi3-ft] Ticket #324: Clarify MPI_ERRORS_ARE_FATAL scope of abort
bouteill at icl.utk.edu
Mon May 13 22:49:11 CDT 2013
They are, but not in the scope of the ticket, which is only to define the propagation range of abort (restricting it to the comm on which it is invoked, rather than all processes). The items discussed in the ticket comments are important, but unrelated to the ticket itself.
The simple answer to the problem of "my free request bring the entire system down" is "do not free requests that are not completed, it will trigger the wrong error handler". If really this is an important functionality that needs to be supported, I advocate dealing with it in a separate ticket. The ticket at hand clarifies -only- the scope of abort propagation (on the comm), this distinction would clarify what communicator abort should be called on for orphaned requests; two different issues.
For buffered sends, no error should be triggered. The buffered send has succeeded (in the sense that the expected specification on the send buffer has been met). It is a case of lazy error detection that is perfectly valid in the context of ULFM (just like it may appear that some process has been reached by a bcast, from the perspective of the root, but was in fact dead). An error will be returned in a subsequent operation. Remember that no correctness is lost here, the scenario where the message is fully received -just before- the receiver dies is isomorphic in term of "returned errors", so the application must already deal with such a case.
Le 13 mai 2013 à 23:01, George Bosilca <bosilca at icl.utk.edu> a écrit :
> Dave points are entirely valid and represent a subtle [corner-]case in the standard. Orphaned non-completed requests, in the sense that the request was freed (MPI_Request_free) and the associated communicator was freed as well (MPI_Comm_free) are defined as raising errors on MPI_COMM_WORLD. Thus, the scope of the request become global, and a fault on such particular requests will bring down the entire MPI_COMM_WORLD (which is against the original scope of the ticket, at least the first part).
> On May 13, 2013, at 12:05 , Wesley Bland <wbland at mcs.anl.gov> wrote:
>> After looking at this ticket some more, Aurelien and I were confused about the objections to the ticket from the forum at large. It appeared that some of the objections reported by Dave on the ticket might have come from a misunderstanding in the forum of what the ticket meant. The proposed plan at this point is to discuss the ticket during our plenary in San Jose to try to discern the objection so we can bring a new version of this ticket if necessary or start the process again if the text is good.
>> On May 7, 2013, at 4:53 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>>> author: jjhursey
>>> This ticket essentially links MPI_ERRORS_ARE_FATAL on a communicator to calling MPI_ABORT on the communicator, i.e. only the processes in that communicator are aborted, while other communicators could potentially remain functional.
>>> There was much discussion on the ticket about the scope of this change, and in the end the ticket has remained stagnant for about a year because of it, however I don't think that the changes here should be too controversial. According to the ticket, the main argument against it at the Japan meeting was that for some types of functions, there is not a request which can be used to provide error checking and therefore when an error occurs, the entire application would be forced to fall back to MPI_ERRORS_ARE_FATAL despite setting another error handler, therefore making FT difficult. Some alternate text was provided on the ticket.
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375
More information about the mpiwg-ft