[Mpi3-ft] MPI_Init / MPI_Finalize
Fab Tillier
ftillier at microsoft.com
Wed Aug 25 23:15:45 CDT 2010
Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 21:08:36
>
> Fab:
>
> There is no wiggle room. MPI_FINALIZE is collective across
> MPI_COMM_WORLD. I do not understand why you would say otherwise.
> Here is more of the passage I was quoting:
>
> -----------------
>
> MPI_FINALIZE is collective over all connected processes. If no processes
> were spawned, accepted or connected then this means over MPI_COMM_WORLD;
Ahh, I missed this part, sorry.
-Fab
> otherwise it is collective over the union of all processes that have
> been and continue to be connected, as explained in Section Releasing
> Connections on page Releasing Connections.
>
> -----------------
>
> The "connected" terminology is used to handle dynamic process
> management issues, for which the set of all processes cannot
> easily be defined in terms of a single communicator.
>
> Bronis
>
>
>
>
> On Wed, 25 Aug 2010, Fab Tillier wrote:
>
>> What defines "connected"? MPI_FINALIZE isn't collective across
> MPI_COMM_WORLD, as processes might never communicate with one another.
> Even if they do, communication may not require a connection, so they
> may never be connected.
>>
>> It seems to me there might be enough wiggle room in the standard to
>> allow MPI_Finalize to not be collective at all?
>>
>> -Fab
>>
>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 15:06:38
>>
>>>
>>> Josh:
>>>
>>> On p293 of the 2.2 standard, it says "MPI_FINALIZE is collective
>>> over all connected processes." I don't know that the call being
>>> collective changes your analysis but your statement that the
>>> call is not collective was incorrect...
>>>
>>> Bronis
>>>
>>>
>>> On Wed, 25 Aug 2010, Joshua Hursey wrote:
>>>
>>>> During the discussion of the run-though stabilization proposal today
>>>> on the teleconf, we spent a while discussing the expected behavior of
>>>> MPI_Init and MPI_Finalize in the presence of process failures. I
>>>> would like to broaden the discussion a bit to help pin down the
>>>> expected behavior.
>>>>
>>>> MPI_Init(): ----------- Problem: If a process fails before or during
>>>> MPI_Init, what should the MPI implementation do?
>>>>
>>>> The current standard says nothing about the return value of
>>>> MPI_Init() (Ch. 8.7). To the greatest possible extent the application
>>>> should not be put in danger if it wishes to ignore errors (assumes
>>>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>>>> contrast to aborting the job) might be dangerous. However, if the
>>>> application is prepared to handle process failures, it is unable to
>>>> communicate that information to the MPI implementation until after
>>>> the completion of MPI_Init().
>>>>
>>>> So a couple of solutions were presented each with pros and cons
>>>> (please fill in if I missed any): 1) If a process fails in MPI_Init()
>>>> (default error handler is MPI_ERRORS_ARE_FATAL) then the entire job
>>>> is aborted (similar to calling MPI_Abort on MPI_COMM_WORLD).
>>>>
>>>> 2) If a process fails in MPI_Init() the MPI implementation will
>>>> return an appropriate error code/class (e.g.,
>>>> MPI_ERR_RANK_FAIL_STOP), and all subsequent calls into the MPI
>>>> implementation will return the error class MPI_ERR_OTHER (should be
>>>> create a MPI_ERR_NOT_ACTIVE?). Applications should eventually notice
>>>> the error and terminate.
>>>>
>>>> 3) Allow the application to register only the MPI_ERRORS_RETURN
>>>> handle on MPI_COMM_WORLD before MPI_Init() using the
>>>> MPI_Errhandler_set() function. Errors that occur before the
>>>> MPI_Errhandler_set() call are fatal. Errors afterward, including
>>>> during MPI_Init() are not fatal.
>>>>
>>>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>>>> indicate a process failure, is the library usable or not? If the
>>>> application can continue running through the failure, then the MPI
>>>> library should still be usable, thus MPI_Init() must be fault
>>>> tolerant in its initialization to be able to handle process failures.
>>>> If the MPI implementation finds itself in trouble and cannot continue
>>>> it should return MPI_ERR_CANNOT_CONTINUE from all subsequent calls
>>>> including MPI_Init, if possible.
>>>>
>>>>
>>>> MPI_Finalize(): --------------- Problem: If a process fails before or
>>>> during MPI_Finalize (and the error handler is not
>>>> MPI_ERRORS_ARE_FATAL), what should this function return? Should that
>>>> return value be consistent to all processes?
>>>>
>>>> To preserve locality of fault handling, a local process should not be
>>>> explicitly forced to recognize the failure of a peer process that
>>>> they never interact with neither directly (e.g., point-to-point) or
>>>> indirectly (e.g., collective). So MPI_Finalize should be fault
>>>> tolerant and keep trying to complete even in the presence of failures.
>>>>
>>>> MPI_Finalize is not required to be a collective operation, though it
>>>> is often implemented that way. An implementation may need to delay
>>>> the return from MPI_Finalize until its role in the failure
>>>> information distribution channel is complete. But we should not
>>>> require a multi- phase commit protocol to ensure that everyone either
>>>> succeeds or returns some error. Implementations may do so internally
>>>> in order to ensure that MPI_Finalize does not hang.
>>>>
>>>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP
>>>> indicating a 'new to this rank' failure), what good is this
>>>> information to the application? It cannot query for which rank(s)
>>>> failed since MPI has been finalized. Nor can it initiate recovery.
>>>> The best it could do is assume that all other processes failed and
>>>> take local action.
>>>>
>>>>
>>>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>>>> -------------------------------------------- In chapter 8, Example
>>>> 8.7 illustrates that "Although it is not required that all processes
>>>> return from MPI_Finalize, it is required that at least process 0 in
>>>> MPI_COMM_WORLD return, so that users can know that the MPI portion of
>>>> the computation is over."
>>>>
>>>> We deduced that the reasoning for this explanation was to allow for
>>>> MPI implementation that create and destroy MPI processes during
>>>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>>>> rank that can be assumed to exist before MPI_Init and after
>>>> MPI_Finalize.
>>>>
>>>> Problem: So what if rank 0 fails at some point during the computation
>>>> (or just some point during MPI_Finalize)?
>>>>
>>>> In the proposal, I added an advice to users to tell them to not
>>>> depend on any specific ranks to exist before MPI_Init or after
>>>> MPI_Finalize. So, in a faulty environment, the example will produce
>>>> incorrect results under certain failure scenarios (e.g., failure of
>>>> rank 0).
>>>>
>>>> In an MPI environment that depends on rank 0 for process creation and
>>>> destruction, the failure of rank 0 is (should be?) critical and the
>>>> MPI implementation will either abort the job or return
>>>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So
>>>> we believe that the advice to users was a sufficient addition to this
>>>> section. What do others think?
>>>>
>>>>
>>>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What
>>>> do folks think about the presented problems and possible solutions?
>>>> Are there other issues not mentioned here that we should be
>>>> addressing?
>>>>
>>>> -- Josh
>>>>
>>>> Run-Through Stabilization Proposal:
>>>> https://**svn.mpi-forum.org/trac/mpi-forum-
>>>> web/wiki/ft/run_through_stabilization
>>>>
>>>> ------------------------------------
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://**www.**cs.indiana.edu/~jjhursey
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
More information about the mpiwg-ft
mailing list