[Mpi3-ft] Exit Code from 'mpirun' upon failure recovery
jjhursey at open-mpi.org
Fri Feb 4 10:50:40 CST 2011
The standard is pretty cagey about this issue. The only place where I see it referenced is after MPI_Abort() where it says:
Advice to users. Whether the errorcode is returned from the executable or from the MPI process startup mechanism (e.g., mpiexec), is an aspect of quality of the MPI library but not mandatory. (End of advice to users.)
Advice to implementors. Where possible, a high-quality implementation will try to return the errorcode from the MPI process startup mechanism (e.g. mpiexec or singleton init). (End of advice to implementors.)
So it does not say what happens if MPI_Abort is called from multiple processes each with different errorcodes, or if the MPI implementation chooses to continue with the application and it completes normally later.
Since the standard is reluctant to provide us guidance, I suspect that any decision we make is not appropriate for explicit standardization -- Maybe it can be an additional 'Advice to implementors'. I was mostly trying to see if anyone had strong feelings on the expected behavior in the various fault tolerant scenarios. In particular, I know that some applications check for non-zero return code from 'mpirun' as one indicator that the job did not complete successfully, and determine if they should take recovery steps after job completion.
On Feb 4, 2011, at 11:12 AM, Darius Buntinas wrote:
> What does the standard suggest for the case when different processes return different return codes? Can't we use the same approach?
> On Feb 4, 2011, at 8:19 AM, Joshua Hursey wrote:
>> So this is not really appropriate for the MPI standard language, but more of a user experience question. In fact this is a much larger question that implementations have to struggle with already.
>> If a process fails in the application (either by external causes or by calling MPI_Abort), what should 'mpirun' return as its exit status?
>> If the application intends to handle the failure and continue running after recovering then they may expect that as long as MPI_Finalize is called in all remaining processes that 'mpirun' return '0' or success. But if no process calls MPI_Finalize (because they either called MPI_Abort or terminated abnormally) that it return a non-zero value - probably one of the values that they set in MPI_Abort, if possible. Of course there is the case where the failure occurs during MPI_Finalize, to which the MPI implementation may or may not be able to act consistently depending on the timing of the failure notification.
>> I was mucking around in this code in the Open MPI prototype, and thought I would get the opinions of the group as I move forward.
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
Postdoctoral Research Associate
Oak Ridge National Laboratory
More information about the mpiwg-ft