[Mpi3-ft] MPI_Init / MPI_Finalize
Joshua Hursey
jjhursey at open-mpi.org
Thu Aug 26 10:41:58 CDT 2010
On Aug 26, 2010, at 11:03 AM, Bronis R. de Supinski wrote:
>
> Josh:
>
> Re:
>> Bronis, thanks for the clarification on MPI_Finalize. I guess the
>> detail that I was trying to get at is that it is not specified whether
>> MPI_Finalize is a leave early/enter late kind of collective.
>>
>> So, at least the way I read it, it would be valid if one process enters
>> and exits MPI_Finalize before a different process enters MPI_Finalize
>> (similar to MPI_Bcast).
>
> Yes, that is correct. Since nothing is explicitly stated, it falls into
> the general category of collectives that users must treat as synchronizing
> (in terms of deadlock) although implementations may not be.
>
>> This points to the reasoning behind the advice to implementors just
>> above the cited paragraph that suggests a barrier operation during
>> MPI_Finalize as one option. Am I interpreting this correctly?
>
> Yes.
>
>> One misleading sentence to me is the following on p291 just after the
>> definition of MPI_Finalize: "Each process must call MPI_FINALIZE before
>> it exits." The 'it' is slightly unclear. I think this is referring to
>> the process exiting, not the function. If 'it' referred to the function
>> then this would disallow the collective to have some ranks leave the
>> collective before all have joined, so requiring a barrier semantic.
>
> I can see the ambiguity. I would be in favor of rewriting the
> sentence to eliminate the pronoun or at least the ambiquity.
> How about: "Before each process exits, it must call MPI_FINALIZE."
> I think that is clearly what was intended (not from the immediate
> context but from the general treatment of collectives).
I agree. I filed a ticket on Trac (not sure if I did it 100% correctly) so we can try to get this ambiguity fixed.
https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/227
>
>> For the fault tolerance discussion, I think the question is on the
>> consistency of the return code. Should the return code have commit/abort
>> properties? So if there is a success then all processes return success.
>> If some process fails during MPI_Finalize, should all processes return
>> some error code.
>
> Not having followed everything in the FT working group closely,
> it is hard for me to answer that in terms of what the group
> thinking is. However, it is clear to me that you cannot require
> a collective to be synchronizing just to ensure return code
> agreement. I would feel that was antithetical to the primary
> goal of MPI. Perhaps a user could poll to find out if an error
> occurs subsequently. I suppose that would result in a call that
> could be made after MPI_FINALIZE. However, I think I would
> argue that once you call MPI_FINALIZE, you don't care...
>
>> It is unclear to me if this property (commit/abort return codes) is
>> really useful to the application. If all processes return success then a
>> process fails directly afterwards, the other processes have no way of
>> being notified. So what action could the remaining processes
>> realistically take in either the success or failure case.
>
> Further, those processes may not even exist. It is not really
> clear what happens to most processes after MPI_FINALIZE, which
> I felt was your primary point initially and with which I agree.
> Ultimately, why would we want to create any possible performance
> penalty to disseminate errors during/after MPI_FINALIZE?
>
>> So my suggestion is that we allow MPI_Finalize to preserve its loose
>> synchrony, leave early collective property as long as the rank is no
>> longer needed to continue interacting with any of the connected process
>> (say for relaying error information). This means that some ranks may
>> return success while other return error if a process fails during
>> finalize. MPI implementations may choose to provide applications with
>> commit/abort semantics, but are not required to do so.
>>
>> Does that sounds reasonable for MPI_Finalize?
>
> Yes, I agree.
I also agree with your statements above. If the process calls MPI_Finalize then it is done with MPI and it shouldn't care if there were errors.
Thanks,
Josh
>
> Bronis
>
>> Thanks,
>> Josh
>>
>> On Aug 26, 2010, at 12:15 AM, Fab Tillier wrote:
>>
>>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 21:08:36
>>>
>>>>
>>>> Fab:
>>>>
>>>> There is no wiggle room. MPI_FINALIZE is collective across
>>>> MPI_COMM_WORLD. I do not understand why you would say otherwise.
>>>> Here is more of the passage I was quoting:
>>>>
>>>> -----------------
>>>>
>>>> MPI_FINALIZE is collective over all connected processes. If no processes
>>>> were spawned, accepted or connected then this means over MPI_COMM_WORLD;
>>>
>>> Ahh, I missed this part, sorry.
>>>
>>> -Fab
>>>
>>>> otherwise it is collective over the union of all processes that have
>>>> been and continue to be connected, as explained in Section Releasing
>>>> Connections on page Releasing Connections.
>>>>
>>>> -----------------
>>>>
>>>> The "connected" terminology is used to handle dynamic process
>>>> management issues, for which the set of all processes cannot
>>>> easily be defined in terms of a single communicator.
>>>>
>>>> Bronis
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 25 Aug 2010, Fab Tillier wrote:
>>>>
>>>>> What defines "connected"? MPI_FINALIZE isn't collective across
>>>> MPI_COMM_WORLD, as processes might never communicate with one another.
>>>> Even if they do, communication may not require a connection, so they
>>>> may never be connected.
>>>>>
>>>>> It seems to me there might be enough wiggle room in the standard to
>>>>> allow MPI_Finalize to not be collective at all?
>>>>>
>>>>> -Fab
>>>>>
>>>>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 15:06:38
>>>>>
>>>>>>
>>>>>> Josh:
>>>>>>
>>>>>> On p293 of the 2.2 standard, it says "MPI_FINALIZE is collective
>>>>>> over all connected processes." I don't know that the call being
>>>>>> collective changes your analysis but your statement that the
>>>>>> call is not collective was incorrect...
>>>>>>
>>>>>> Bronis
>>>>>>
>>>>>>
>>>>>> On Wed, 25 Aug 2010, Joshua Hursey wrote:
>>>>>>
>>>>>>> During the discussion of the run-though stabilization proposal today
>>>>>>> on the teleconf, we spent a while discussing the expected behavior of
>>>>>>> MPI_Init and MPI_Finalize in the presence of process failures. I
>>>>>>> would like to broaden the discussion a bit to help pin down the
>>>>>>> expected behavior.
>>>>>>>
>>>>>>> MPI_Init(): ----------- Problem: If a process fails before or during
>>>>>>> MPI_Init, what should the MPI implementation do?
>>>>>>>
>>>>>>> The current standard says nothing about the return value of
>>>>>>> MPI_Init() (Ch. 8.7). To the greatest possible extent the application
>>>>>>> should not be put in danger if it wishes to ignore errors (assumes
>>>>>>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>>>>>>> contrast to aborting the job) might be dangerous. However, if the
>>>>>>> application is prepared to handle process failures, it is unable to
>>>>>>> communicate that information to the MPI implementation until after
>>>>>>> the completion of MPI_Init().
>>>>>>>
>>>>>>> So a couple of solutions were presented each with pros and cons
>>>>>>> (please fill in if I missed any): 1) If a process fails in MPI_Init()
>>>>>>> (default error handler is MPI_ERRORS_ARE_FATAL) then the entire job
>>>>>>> is aborted (similar to calling MPI_Abort on MPI_COMM_WORLD).
>>>>>>>
>>>>>>> 2) If a process fails in MPI_Init() the MPI implementation will
>>>>>>> return an appropriate error code/class (e.g.,
>>>>>>> MPI_ERR_RANK_FAIL_STOP), and all subsequent calls into the MPI
>>>>>>> implementation will return the error class MPI_ERR_OTHER (should be
>>>>>>> create a MPI_ERR_NOT_ACTIVE?). Applications should eventually notice
>>>>>>> the error and terminate.
>>>>>>>
>>>>>>> 3) Allow the application to register only the MPI_ERRORS_RETURN
>>>>>>> handle on MPI_COMM_WORLD before MPI_Init() using the
>>>>>>> MPI_Errhandler_set() function. Errors that occur before the
>>>>>>> MPI_Errhandler_set() call are fatal. Errors afterward, including
>>>>>>> during MPI_Init() are not fatal.
>>>>>>>
>>>>>>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>>>>>>> indicate a process failure, is the library usable or not? If the
>>>>>>> application can continue running through the failure, then the MPI
>>>>>>> library should still be usable, thus MPI_Init() must be fault
>>>>>>> tolerant in its initialization to be able to handle process failures.
>>>>>>> If the MPI implementation finds itself in trouble and cannot continue
>>>>>>> it should return MPI_ERR_CANNOT_CONTINUE from all subsequent calls
>>>>>>> including MPI_Init, if possible.
>>>>>>>
>>>>>>>
>>>>>>> MPI_Finalize(): --------------- Problem: If a process fails before or
>>>>>>> during MPI_Finalize (and the error handler is not
>>>>>>> MPI_ERRORS_ARE_FATAL), what should this function return? Should that
>>>>>>> return value be consistent to all processes?
>>>>>>>
>>>>>>> To preserve locality of fault handling, a local process should not be
>>>>>>> explicitly forced to recognize the failure of a peer process that
>>>>>>> they never interact with neither directly (e.g., point-to-point) or
>>>>>>> indirectly (e.g., collective). So MPI_Finalize should be fault
>>>>>>> tolerant and keep trying to complete even in the presence of failures.
>>>>>>>
>>>>>>> MPI_Finalize is not required to be a collective operation, though it
>>>>>>> is often implemented that way. An implementation may need to delay
>>>>>>> the return from MPI_Finalize until its role in the failure
>>>>>>> information distribution channel is complete. But we should not
>>>>>>> require a multi- phase commit protocol to ensure that everyone either
>>>>>>> succeeds or returns some error. Implementations may do so internally
>>>>>>> in order to ensure that MPI_Finalize does not hang.
>>>>>>>
>>>>>>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP
>>>>>>> indicating a 'new to this rank' failure), what good is this
>>>>>>> information to the application? It cannot query for which rank(s)
>>>>>>> failed since MPI has been finalized. Nor can it initiate recovery.
>>>>>>> The best it could do is assume that all other processes failed and
>>>>>>> take local action.
>>>>>>>
>>>>>>>
>>>>>>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>>>>>>> -------------------------------------------- In chapter 8, Example
>>>>>>> 8.7 illustrates that "Although it is not required that all processes
>>>>>>> return from MPI_Finalize, it is required that at least process 0 in
>>>>>>> MPI_COMM_WORLD return, so that users can know that the MPI portion of
>>>>>>> the computation is over."
>>>>>>>
>>>>>>> We deduced that the reasoning for this explanation was to allow for
>>>>>>> MPI implementation that create and destroy MPI processes during
>>>>>>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>>>>>>> rank that can be assumed to exist before MPI_Init and after
>>>>>>> MPI_Finalize.
>>>>>>>
>>>>>>> Problem: So what if rank 0 fails at some point during the computation
>>>>>>> (or just some point during MPI_Finalize)?
>>>>>>>
>>>>>>> In the proposal, I added an advice to users to tell them to not
>>>>>>> depend on any specific ranks to exist before MPI_Init or after
>>>>>>> MPI_Finalize. So, in a faulty environment, the example will produce
>>>>>>> incorrect results under certain failure scenarios (e.g., failure of
>>>>>>> rank 0).
>>>>>>>
>>>>>>> In an MPI environment that depends on rank 0 for process creation and
>>>>>>> destruction, the failure of rank 0 is (should be?) critical and the
>>>>>>> MPI implementation will either abort the job or return
>>>>>>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So
>>>>>>> we believe that the advice to users was a sufficient addition to this
>>>>>>> section. What do others think?
>>>>>>>
>>>>>>>
>>>>>>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What
>>>>>>> do folks think about the presented problems and possible solutions?
>>>>>>> Are there other issues not mentioned here that we should be
>>>>>>> addressing?
>>>>>>>
>>>>>>> -- Josh
>>>>>>>
>>>>>>> Run-Through Stabilization Proposal:
>>>>>>> https://***svn.mpi-forum.org/trac/mpi-forum-
>>>>>>> web/wiki/ft/run_through_stabilization
>>>>>>>
>>>>>>> ------------------------------------
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> http://***www.***cs.indiana.edu/~jjhursey
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mpi3-ft mailing list
>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>> http://***lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>
>>>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://*www.*cs.indiana.edu/~jjhursey
>>
>>
>>
>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey
More information about the mpiwg-ft
mailing list