[Mpi3-ft] MPI_Init / MPI_Finalize

Wed Aug 25 22:52:32 CDT 2010

What defines "connected"?  MPI_FINALIZE isn't collective across MPI_COMM_WORLD, as processes might never communicate with one another.  Even if they do, communication may not require a connection, so they may never be connected.

It seems to me there might be enough wiggle room in the standard to allow MPI_Finalize to not be collective at all?

-Fab

Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 15:06:38

> 
> Josh:
> 
> On p293 of the 2.2 standard, it says "MPI_FINALIZE is collective
> over all connected processes." I don't know that the call being
> collective changes your analysis but your statement that the
> call is not collective was incorrect...
> 
> Bronis
> 
> 
> On Wed, 25 Aug 2010, Joshua Hursey wrote:
> 
>> During the discussion of the run-though stabilization proposal today
>> on the teleconf, we spent a while discussing the expected behavior of
>> MPI_Init and MPI_Finalize in the presence of process failures. I would
>> like to broaden the discussion a bit to help pin down the expected
>> behavior.
>> 
>> MPI_Init(): ----------- Problem: If a process fails before or during
>> MPI_Init, what should the MPI implementation do?
>> 
>> The current standard says nothing about the return value of
>> MPI_Init() (Ch. 8.7). To the greatest possible extent the application
>> should not be put in danger if it wishes to ignore errors (assumes
>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>> contrast to aborting the job) might be dangerous. However, if the
>> application is prepared to handle process failures, it is unable to
>> communicate that information to the MPI implementation until after the
>> completion of MPI_Init().
>> 
>> So a couple of solutions were presented each with pros and cons (please
>> fill in if I missed any): 1) If a process fails in MPI_Init() (default
>> error handler is
>> MPI_ERRORS_ARE_FATAL) then the entire job is aborted (similar to
>> calling MPI_Abort on MPI_COMM_WORLD).
>> 
>> 2) If a process fails in MPI_Init() the MPI implementation will
>> return an appropriate error code/class (e.g., MPI_ERR_RANK_FAIL_STOP),
>> and all subsequent calls into the MPI implementation will return the
>> error class MPI_ERR_OTHER (should be create a MPI_ERR_NOT_ACTIVE?).
>> Applications should eventually notice the error and terminate.
>> 
>> 3) Allow the application to register only the MPI_ERRORS_RETURN
>> handle on MPI_COMM_WORLD before MPI_Init() using the
>> MPI_Errhandler_set() function. Errors that occur before the
>> MPI_Errhandler_set() call are fatal. Errors afterward, including during
>> MPI_Init() are not fatal.
>> 
>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>> indicate a process failure, is the library usable or not? If the
>> application can continue running through the failure, then the MPI
>> library should still be usable, thus MPI_Init() must be fault tolerant
>> in its initialization to be able to handle process failures. If the MPI
>> implementation finds itself in trouble and cannot continue it should
>> return MPI_ERR_CANNOT_CONTINUE from all subsequent calls including
>> MPI_Init, if possible.
>> 
>> 
>> MPI_Finalize():
>> ---------------
>> Problem: If a process fails before or during MPI_Finalize (and the
>> error handler is not MPI_ERRORS_ARE_FATAL), what should this function
>> return? Should that return value be consistent to all processes?
>> 
>> To preserve locality of fault handling, a local process should not be
>> explicitly forced to recognize the failure of a peer process that they
>> never interact with neither directly (e.g., point-to-point) or
>> indirectly (e.g., collective). So MPI_Finalize should be fault tolerant
>> and keep trying to complete even in the presence of failures.
>> 
>> MPI_Finalize is not required to be a collective operation, though it
>> is often implemented that way. An implementation may need to delay the
>> return from MPI_Finalize until its role in the failure information
>> distribution channel is complete. But we should not require a multi-
>> phase commit protocol to ensure that everyone either succeeds or
>> returns some error. Implementations may do so internally in order to
>> ensure that MPI_Finalize does not hang.
>> 
>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP
>> indicating a 'new to this rank' failure), what good is this information
>> to the application? It cannot query for which rank(s) failed since MPI
>> has been finalized. Nor can it initiate recovery. The best it could do
>> is assume that all other processes failed and take local action.
>> 
>> 
>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>> --------------------------------------------
>> In chapter 8, Example 8.7 illustrates that "Although it is not
>> required that all processes return from MPI_Finalize, it is required
>> that at least process 0 in MPI_COMM_WORLD return, so that users can
>> know that the MPI portion of the computation is over."
>> 
>> We deduced that the reasoning for this explanation was to allow for
>> MPI implementation that create and destroy MPI processes during
>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>> rank that can be assumed to exist before MPI_Init and after
>> MPI_Finalize.
>> 
>> Problem: So what if rank 0 fails at some point during the computation
>> (or just some point during MPI_Finalize)?
>> 
>> In the proposal, I added an advice to users to tell them to not
>> depend on any specific ranks to exist before MPI_Init or after
>> MPI_Finalize. So, in a faulty environment, the example will produce
>> incorrect results under certain failure scenarios (e.g., failure of
>> rank 0).
>> 
>> In an MPI environment that depends on rank 0 for process creation and
>> destruction, the failure of rank 0 is (should be?) critical and the MPI
>> implementation will either abort the job or return
>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So we
>> believe that the advice to users was a sufficient addition to this
>> section. What do others think?
>> 
>> 
>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What
>> do folks think about the presented problems and possible solutions? Are
>> there other issues not mentioned here that we should be addressing?
>> 
>> -- Josh
>> 
>> Run-Through Stabilization Proposal:
>>  https://*svn.mpi-forum.org/trac/mpi-forum-
>> web/wiki/ft/run_through_stabilization
>> 
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://*www.*cs.indiana.edu/~jjhursey
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft