[Mpi3-ft] Defining the state of MPI after an error

Joshua Hursey jjhursey at open-mpi.org
Mon Sep 20 10:02:36 CDT 2010


Yeah. So we are just defining what MPI should do if the application tried to call back into it after an error. We are not assessing if the application is erroneous by receiving the error in the first place. Certainly for MPI_ERR_UNSUPPORTED_OPERATION it is not the application's fault that the MPI library does not support the operation, and the application may very well be able to work around unsupported operations. For other error classes, it is slightly less clear if it is an application bug or not.

Since the state of MPI is undefined after an error, it is possible that the user could call an MPI function after receiving an error and MPI returns SUCCESS even though it may not have done anything that the user expected. Additionally, it is undefined if the application can call MPI_ERROR_STRING to get a string to display before calling MPI_Abort (also undefined if it can be called).

So this proposal just defines an ability for the MPI implementation to lock the user out of the MPI library if it can no longer continue operating normally. This way the application can look for the MPI_ERR_CANNOT_CONTINUE error class to know when the MPI library is no longer usable. Further, if it receives any other return code other than MPI_ERR_CANNOT_CONTINUE, it knows that the MPI library is behaving normally for that operation.

-- Josh

On Sep 20, 2010, at 10:42 AM, Darius Buntinas wrote:

> 
> I don't think Josh meant that the MPI implementation would fix application bugs, but rather that the return of an error class other than CANNOT_CONTINUE means that the implementation is in an internally consistent state and that it can continue performing MPI functions.
> 
> -d
> 
> On Sep 20, 2010, at 9:33 AM, Richard Treumann wrote:
> 
>> 
>> How does an application experience errors in classes (MPI_ERR_COUNT, MPI_ERR_TAG) except by a bug in the application itself? 
>> 
>> How can it be easier for someone to know how to continue from an arbitrary application bug with confidence that the application is still giving good answers, than to just fix the app? 
>> 
>> 
>> Dick Treumann  -  MPI Team           
>> IBM Systems & Technology Group
>> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
>> Tele (845) 433-7846         Fax (845) 433-8363
>> 
>> 
>> 
>> From:	Joshua Hursey <jjhursey at open-mpi.org>
>> To:	"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" <mpi3-ft at lists.mpi-forum.org>
>> Date:	09/20/2010 10:05 AM
>> Subject:	[Mpi3-ft] Defining the state of MPI after an error
>> Sent by: 	mpi3-ft-bounces at lists.mpi-forum.org
>> 
>> 
>> 
>> 
>> During EuroMPI and the MPI Forum meeting last week the issue of the MPI state after an error was brought up a few times. The issue is that since the state is undefined then no portable program can be written that uses the errorhandlers then MPI functionality following the error. This issue is particularly difficult for applications that wish to catch informational or warning type errors (e.g., MPI_ERR_COUNT, MPI_ERR_TAG, MPI_ERR_UNSUPPORTED_OPERATION). These operations are often recoverable by the MPI implementation and/or the application.
>> 
>> To address this portability issue, I am bringing out the MPI_ERR_CANNOT_CONTINUE error class from the stabilization proposal. I presented the idea to the MPI Forum during a plenary session last week and received a positive response on building a formal proposal [Straw vote: 22 (yes), 0 (no), 3 (abstain)].
>> 
>> I have created a first draft of the proposal for the working group to review on the wiki at the link below:
>> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue
>> 
>> I would like to have this proposal ready by the Oct. meeting so we can have a formal plenary session on it. If all goes well, maybe we can get a first reading by Dec.
>> 
>> Let me know what you think about this proposal.
>> 
>> -- Josh
>> 
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://www.cs.indiana.edu/~jjhursey
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey





More information about the mpiwg-ft mailing list