[Mpi3-ft] New version of the RTS proposal

Josh Hursey jjhursey at open-mpi.org
Wed Nov 9 08:22:04 CST 2011

I suspect that maybe I've confused myself thinking about this too
much, and maybe it is best to talk at the teleconf today about this.

I guess what was troubling me was in the scenario where MPI_INIT
cannot complete successfully due to process failure (e.g., it messed
up the connection establishment in a way that cannot be recovered). So
MPI_Init bails out with an error. What can the MPI application do? It
is not allowed to call MPI_Init again, and the state of the internal
library is non-functional in this case.

So in the example above, I would think that the MPI implementation
would terminate the calling process because the user is left with no
recourse in managing that scenario. The user may not be abel to even
set an error handler or call MPI_Finalize.

However if MPI_Init found a process failure during initialization, and
was able to successfully setup the library in spite of the failure
then I think we are in the safe zone for the user. We need MPI_Init to
return success and raise an error at the next MPI operation.

So we previously said that:
 "if a process failure occurs before or during MPI_INIT then MPI_INIT
should try to raise and error and not abort by default."
If MPI_Init raises an error then what does that mean to the
application. If it is PROC_FAIL_STOP, does that mean that the MPI
library is initialized, but detected a process failure? or does it
mean that the MPI library is not setup because it returned an error
(the operation failed).

It might be better to say that MPI_INIT will -not- return an error of
the class PROC_FAIL_STOP if a process failure is detected during
MPI_INIT but the library initialized properly. This error is delayed
until the next MPI call.

... But then what is the difference between detecting the process
failure during MPI_INIT versus just afterward? Maybe I'm just troubled
by the idea that MPI_INIT can return an error, but still have
completed successfully.

What do you all think?

-- Josh

On Tue, Nov 8, 2011 at 4:00 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> I think what we have doesn't require it to complete successfully.  It says MPI_INIT should try to raise an error and not abort by default.  Then the advice to implementors says a critical error can make it abort.
> However, I think maybe we should say that MPI_Init is not required to raise an error simply because it detected a process failure.  It should raise an error (and give an error on the next non-errhandler function blah blah...) if it is unable to initialize the library due to a process failure.
> Or is this what you were suggesting?
> -d
> On Nov 7, 2011, at 12:58 PM, Josh Hursey wrote:
>> * We need to review 17.5.6: What if MPI_INIT does not internally
>> complete successfully due to process failure? The text seems to assume
>> that MPI_INIT will always be able to complete successfully in the
>> presence of failure. Maybe we should state that 'if it is able to
>> complete successfully, then it should even in the presence of
>> failure'?
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list