[Mpi3-ft] MPI_Init / MPI_Finalize

Wed Aug 25 23:15:45 CDT 2010

Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 21:08:36

> 
> Fab:
> 
> There is no wiggle room. MPI_FINALIZE is collective across
> MPI_COMM_WORLD. I do not understand why you would say otherwise.
> Here is more of the passage I was quoting:
> 
> -----------------
> 
> MPI_FINALIZE is collective over all connected processes. If no processes
> were spawned, accepted or connected then this means over MPI_COMM_WORLD;

Ahh, I missed this part, sorry.

-Fab

> otherwise it is collective over the union of all processes that have
> been and continue to be connected, as explained in Section Releasing
> Connections  on page Releasing Connections.
> 
> -----------------
> 
> The "connected" terminology is used to handle dynamic process
> management issues, for which the set of all processes cannot
> easily be defined in terms of a single communicator.
> 
> Bronis
> 
> 
> 
> 
> On Wed, 25 Aug 2010, Fab Tillier wrote:
> 
>> What defines "connected"?  MPI_FINALIZE isn't collective across
> MPI_COMM_WORLD, as processes might never communicate with one another.
> Even if they do, communication may not require a connection, so they
> may never be connected.
>> 
>> It seems to me there might be enough wiggle room in the standard to
>> allow MPI_Finalize to not be collective at all?
>> 
>> -Fab
>> 
>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 15:06:38
>> 
>>> 
>>> Josh:
>>> 
>>> On p293 of the 2.2 standard, it says "MPI_FINALIZE is collective
>>> over all connected processes." I don't know that the call being
>>> collective changes your analysis but your statement that the
>>> call is not collective was incorrect...
>>> 
>>> Bronis
>>> 
>>> 
>>> On Wed, 25 Aug 2010, Joshua Hursey wrote:
>>> 
>>>> During the discussion of the run-though stabilization proposal today
>>>> on the teleconf, we spent a while discussing the expected behavior of
>>>> MPI_Init and MPI_Finalize in the presence of process failures. I
>>>> would like to broaden the discussion a bit to help pin down the
>>>> expected behavior.
>>>> 
>>>> MPI_Init(): ----------- Problem: If a process fails before or during
>>>> MPI_Init, what should the MPI implementation do?
>>>> 
>>>> The current standard says nothing about the return value of
>>>> MPI_Init() (Ch. 8.7). To the greatest possible extent the application
>>>> should not be put in danger if it wishes to ignore errors (assumes
>>>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>>>> contrast to aborting the job) might be dangerous. However, if the
>>>> application is prepared to handle process failures, it is unable to
>>>> communicate that information to the MPI implementation until after
>>>> the completion of MPI_Init().
>>>> 
>>>> So a couple of solutions were presented each with pros and cons
>>>> (please fill in if I missed any): 1) If a process fails in MPI_Init()
>>>> (default error handler is MPI_ERRORS_ARE_FATAL) then the entire job
>>>> is aborted (similar to calling MPI_Abort on MPI_COMM_WORLD).
>>>> 
>>>> 2) If a process fails in MPI_Init() the MPI implementation will
>>>> return an appropriate error code/class (e.g.,
>>>> MPI_ERR_RANK_FAIL_STOP), and all subsequent calls into the MPI
>>>> implementation will return the error class MPI_ERR_OTHER (should be
>>>> create a MPI_ERR_NOT_ACTIVE?). Applications should eventually notice
>>>> the error and terminate.
>>>> 
>>>> 3) Allow the application to register only the MPI_ERRORS_RETURN
>>>> handle on MPI_COMM_WORLD before MPI_Init() using the
>>>> MPI_Errhandler_set() function. Errors that occur before the
>>>> MPI_Errhandler_set() call are fatal. Errors afterward, including
>>>> during MPI_Init() are not fatal.
>>>> 
>>>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>>>> indicate a process failure, is the library usable or not? If the
>>>> application can continue running through the failure, then the MPI
>>>> library should still be usable, thus MPI_Init() must be fault
>>>> tolerant in its initialization to be able to handle process failures.
>>>> If the MPI implementation finds itself in trouble and cannot continue
>>>> it should return MPI_ERR_CANNOT_CONTINUE from all subsequent calls
>>>> including MPI_Init, if possible.
>>>> 
>>>> 
>>>> MPI_Finalize(): --------------- Problem: If a process fails before or
>>>> during MPI_Finalize (and the error handler is not
>>>> MPI_ERRORS_ARE_FATAL), what should this function return? Should that
>>>> return value be consistent to all processes?
>>>> 
>>>> To preserve locality of fault handling, a local process should not be
>>>> explicitly forced to recognize the failure of a peer process that
>>>> they never interact with neither directly (e.g., point-to-point) or
>>>> indirectly (e.g., collective). So MPI_Finalize should be fault
>>>> tolerant and keep trying to complete even in the presence of failures.
>>>> 
>>>> MPI_Finalize is not required to be a collective operation, though it
>>>> is often implemented that way. An implementation may need to delay
>>>> the return from MPI_Finalize until its role in the failure
>>>> information distribution channel is complete. But we should not
>>>> require a multi- phase commit protocol to ensure that everyone either
>>>> succeeds or returns some error. Implementations may do so internally
>>>> in order to ensure that MPI_Finalize does not hang.
>>>> 
>>>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP
>>>> indicating a 'new to this rank' failure), what good is this
>>>> information to the application? It cannot query for which rank(s)
>>>> failed since MPI has been finalized. Nor can it initiate recovery.
>>>> The best it could do is assume that all other processes failed and
>>>> take local action.
>>>> 
>>>> 
>>>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>>>> -------------------------------------------- In chapter 8, Example
>>>> 8.7 illustrates that "Although it is not required that all processes
>>>> return from MPI_Finalize, it is required that at least process 0 in
>>>> MPI_COMM_WORLD return, so that users can know that the MPI portion of
>>>> the computation is over."
>>>> 
>>>> We deduced that the reasoning for this explanation was to allow for
>>>> MPI implementation that create and destroy MPI processes during
>>>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>>>> rank that can be assumed to exist before MPI_Init and after
>>>> MPI_Finalize.
>>>> 
>>>> Problem: So what if rank 0 fails at some point during the computation
>>>> (or just some point during MPI_Finalize)?
>>>> 
>>>> In the proposal, I added an advice to users to tell them to not
>>>> depend on any specific ranks to exist before MPI_Init or after
>>>> MPI_Finalize. So, in a faulty environment, the example will produce
>>>> incorrect results under certain failure scenarios (e.g., failure of
>>>> rank 0).
>>>> 
>>>> In an MPI environment that depends on rank 0 for process creation and
>>>> destruction, the failure of rank 0 is (should be?) critical and the
>>>> MPI implementation will either abort the job or return
>>>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So
>>>> we believe that the advice to users was a sufficient addition to this
>>>> section. What do others think?
>>>> 
>>>> 
>>>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What
>>>> do folks think about the presented problems and possible solutions?
>>>> Are there other issues not mentioned here that we should be
>>>> addressing?
>>>> 
>>>> -- Josh
>>>> 
>>>> Run-Through Stabilization Proposal:
>>>>  https://**svn.mpi-forum.org/trac/mpi-forum-
>>>> web/wiki/ft/run_through_stabilization
>>>> 
>>>> ------------------------------------
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://**www.**cs.indiana.edu/~jjhursey
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>>