[Mpi3-ft] Defining the state of MPI after an error

Richard Treumann treumann at us.ibm.com
Mon Sep 20 11:00:51 CDT 2010


I am not talking about libmpi fixing an application bug.  I am talking 
about the fact that if an application has a bug, the state of the 
application becomes unknown.  Something that was part of the algorithm 
that the author was trying to apply to get an answer has not happened as 
envisioned.  How can the application state be trusted? 

I see no problem with urging MPI implementations to refrain from shooting 
down future MPI calls when the user has set MPI_ERRORS_RETURN but I have a 
hard time imagining going much beyond that for application bugs.

For example, a call to MPI_Bcast that has a bad communicator at one task 
will eventually hang but one that has a bad communicator at all tasks can 
continue (the application state is probably corrupted but libmpi should be 
OK)




Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363




From:
Darius Buntinas <buntinas at mcs.anl.gov>
To:
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" 
<mpi3-ft at lists.mpi-forum.org>
Date:
09/20/2010 10:43 AM
Subject:
Re: [Mpi3-ft] Defining the state of MPI after an error
Sent by:
mpi3-ft-bounces at lists.mpi-forum.org




I don't think Josh meant that the MPI implementation would fix application 
bugs, but rather that the return of an error class other than 
CANNOT_CONTINUE means that the implementation is in an internally 
consistent state and that it can continue performing MPI functions.

-d

On Sep 20, 2010, at 9:33 AM, Richard Treumann wrote:

> 
> How does an application experience errors in classes (MPI_ERR_COUNT, 
MPI_ERR_TAG) except by a bug in the application itself? 
> 
> How can it be easier for someone to know how to continue from an 
arbitrary application bug with confidence that the application is still 
giving good answers, than to just fix the app? 
> 
> 
> Dick Treumann  -  MPI Team 
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846         Fax (845) 433-8363
> 
> 
> 
> From:          Joshua Hursey <jjhursey at open-mpi.org>
> To:            "MPI 3.0 Fault Tolerance and Dynamic Process Control 
working Group" <mpi3-ft at lists.mpi-forum.org>
> Date:          09/20/2010 10:05 AM
> Subject:               [Mpi3-ft] Defining the state of MPI after an 
error
> Sent by:               mpi3-ft-bounces at lists.mpi-forum.org
> 
> 
> 
> 
> During EuroMPI and the MPI Forum meeting last week the issue of the MPI 
state after an error was brought up a few times. The issue is that since 
the state is undefined then no portable program can be written that uses 
the errorhandlers then MPI functionality following the error. This issue 
is particularly difficult for applications that wish to catch 
informational or warning type errors (e.g., MPI_ERR_COUNT, MPI_ERR_TAG, 
MPI_ERR_UNSUPPORTED_OPERATION). These operations are often recoverable by 
the MPI implementation and/or the application.
> 
> To address this portability issue, I am bringing out the 
MPI_ERR_CANNOT_CONTINUE error class from the stabilization proposal. I 
presented the idea to the MPI Forum during a plenary session last week and 
received a positive response on building a formal proposal [Straw vote: 22 
(yes), 0 (no), 3 (abstain)].
> 
> I have created a first draft of the proposal for the working group to 
review on the wiki at the link below:
>  
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue
> 
> I would like to have this proposal ready by the Oct. meeting so we can 
have a formal plenary session on it. If all goes well, maybe we can get a 
first reading by Dec.
> 
> Let me know what you think about this proposal.
> 
> -- Josh
> 
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://www.cs.indiana.edu/~jjhursey
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100920/672146cf/attachment-0001.html>


More information about the mpiwg-ft mailing list