[Mpi3-ft] MPI_Init / MPI_Finalize

Thu Aug 26 10:03:16 CDT 2010

Josh:

Re:
> Bronis, thanks for the clarification on MPI_Finalize. I guess the 
> detail that I was trying to get at is that it is not specified whether 
> MPI_Finalize is a leave early/enter late kind of collective.
>
> So, at least the way I read it, it would be valid if one process enters 
> and exits MPI_Finalize  before a different process enters MPI_Finalize 
> (similar to MPI_Bcast).

Yes, that is correct. Since nothing is explicitly stated, it falls into
the general category of collectives that users must treat as synchronizing
(in terms of deadlock) although implementations may not be.

> This points to the reasoning behind the advice to implementors just 
> above the cited paragraph that suggests a barrier operation during 
> MPI_Finalize as one option. Am I interpreting this correctly?

Yes.

> One misleading sentence to me is the following on p291 just after the 
> definition of MPI_Finalize: "Each process must call MPI_FINALIZE before 
> it exits." The 'it' is slightly unclear. I think this is referring to 
> the process exiting, not the function. If 'it' referred to the function 
> then this would disallow the collective to have some ranks leave the 
> collective before all have joined, so requiring a barrier semantic.

I can see the ambiguity. I would be in favor of rewriting the
sentence to eliminate the pronoun or at least the ambiquity.
How about: "Before each process exits, it must call MPI_FINALIZE."
I think that is clearly what was intended (not from the immediate
context but from the general treatment of collectives).

> For the fault tolerance discussion, I think the question is on the 
> consistency of the return code. Should the return code have commit/abort 
> properties? So if there is a success then all processes return success. 
> If some process fails during MPI_Finalize, should all processes return 
> some error code.

Not having followed everything in the FT working group closely,
it is hard for me to answer that in terms of what the group
thinking is. However, it is clear to me that you cannot require
a collective to be synchronizing just to ensure return code
agreement. I would feel that was antithetical to the primary
goal of MPI. Perhaps a user could poll to find out if an error
occurs subsequently. I suppose that would result in a call that
could be made after MPI_FINALIZE. However, I think I would
argue that once you call MPI_FINALIZE, you don't care...

> It is unclear to me if this property (commit/abort return codes) is 
> really useful to the application. If all processes return success then a 
> process fails directly afterwards, the other processes have no way of 
> being notified. So what action could the remaining processes 
> realistically take in either the success or failure case.

Further, those processes may not even exist. It is not really
clear what happens to most processes after MPI_FINALIZE, which
I felt was your primary point initially and with which I agree.
Ultimately, why would we want to create any possible performance
penalty to disseminate errors during/after MPI_FINALIZE?

> So my suggestion is that we allow MPI_Finalize to preserve its loose 
> synchrony, leave early collective property as long as the rank is no 
> longer needed to continue interacting with any of the connected process 
> (say for relaying error information). This means that some ranks may 
> return success while other return error if a process fails during 
> finalize. MPI implementations may choose to provide applications with 
> commit/abort semantics, but are not required to do so.
>
> Does that sounds reasonable for MPI_Finalize?

Yes, I agree.

Bronis

> Thanks,
> Josh
>
> On Aug 26, 2010, at 12:15 AM, Fab Tillier wrote:
>
>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 21:08:36
>>
>>>
>>> Fab:
>>>
>>> There is no wiggle room. MPI_FINALIZE is collective across
>>> MPI_COMM_WORLD. I do not understand why you would say otherwise.
>>> Here is more of the passage I was quoting:
>>>
>>> -----------------
>>>
>>> MPI_FINALIZE is collective over all connected processes. If no processes
>>> were spawned, accepted or connected then this means over MPI_COMM_WORLD;
>>
>> Ahh, I missed this part, sorry.
>>
>> -Fab
>>
>>> otherwise it is collective over the union of all processes that have
>>> been and continue to be connected, as explained in Section Releasing
>>> Connections  on page Releasing Connections.
>>>
>>> -----------------
>>>
>>> The "connected" terminology is used to handle dynamic process
>>> management issues, for which the set of all processes cannot
>>> easily be defined in terms of a single communicator.
>>>
>>> Bronis
>>>
>>>
>>>
>>>
>>> On Wed, 25 Aug 2010, Fab Tillier wrote:
>>>
>>>> What defines "connected"?  MPI_FINALIZE isn't collective across
>>> MPI_COMM_WORLD, as processes might never communicate with one another.
>>> Even if they do, communication may not require a connection, so they
>>> may never be connected.
>>>>
>>>> It seems to me there might be enough wiggle room in the standard to
>>>> allow MPI_Finalize to not be collective at all?
>>>>
>>>> -Fab
>>>>
>>>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 15:06:38
>>>>
>>>>>
>>>>> Josh:
>>>>>
>>>>> On p293 of the 2.2 standard, it says "MPI_FINALIZE is collective
>>>>> over all connected processes." I don't know that the call being
>>>>> collective changes your analysis but your statement that the
>>>>> call is not collective was incorrect...
>>>>>
>>>>> Bronis
>>>>>
>>>>>
>>>>> On Wed, 25 Aug 2010, Joshua Hursey wrote:
>>>>>
>>>>>> During the discussion of the run-though stabilization proposal today
>>>>>> on the teleconf, we spent a while discussing the expected behavior of
>>>>>> MPI_Init and MPI_Finalize in the presence of process failures. I
>>>>>> would like to broaden the discussion a bit to help pin down the
>>>>>> expected behavior.
>>>>>>
>>>>>> MPI_Init(): ----------- Problem: If a process fails before or during
>>>>>> MPI_Init, what should the MPI implementation do?
>>>>>>
>>>>>> The current standard says nothing about the return value of
>>>>>> MPI_Init() (Ch. 8.7). To the greatest possible extent the application
>>>>>> should not be put in danger if it wishes to ignore errors (assumes
>>>>>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>>>>>> contrast to aborting the job) might be dangerous. However, if the
>>>>>> application is prepared to handle process failures, it is unable to
>>>>>> communicate that information to the MPI implementation until after
>>>>>> the completion of MPI_Init().
>>>>>>
>>>>>> So a couple of solutions were presented each with pros and cons
>>>>>> (please fill in if I missed any): 1) If a process fails in MPI_Init()
>>>>>> (default error handler is MPI_ERRORS_ARE_FATAL) then the entire job
>>>>>> is aborted (similar to calling MPI_Abort on MPI_COMM_WORLD).
>>>>>>
>>>>>> 2) If a process fails in MPI_Init() the MPI implementation will
>>>>>> return an appropriate error code/class (e.g.,
>>>>>> MPI_ERR_RANK_FAIL_STOP), and all subsequent calls into the MPI
>>>>>> implementation will return the error class MPI_ERR_OTHER (should be
>>>>>> create a MPI_ERR_NOT_ACTIVE?). Applications should eventually notice
>>>>>> the error and terminate.
>>>>>>
>>>>>> 3) Allow the application to register only the MPI_ERRORS_RETURN
>>>>>> handle on MPI_COMM_WORLD before MPI_Init() using the
>>>>>> MPI_Errhandler_set() function. Errors that occur before the
>>>>>> MPI_Errhandler_set() call are fatal. Errors afterward, including
>>>>>> during MPI_Init() are not fatal.
>>>>>>
>>>>>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>>>>>> indicate a process failure, is the library usable or not? If the
>>>>>> application can continue running through the failure, then the MPI
>>>>>> library should still be usable, thus MPI_Init() must be fault
>>>>>> tolerant in its initialization to be able to handle process failures.
>>>>>> If the MPI implementation finds itself in trouble and cannot continue
>>>>>> it should return MPI_ERR_CANNOT_CONTINUE from all subsequent calls
>>>>>> including MPI_Init, if possible.
>>>>>>
>>>>>>
>>>>>> MPI_Finalize(): --------------- Problem: If a process fails before or
>>>>>> during MPI_Finalize (and the error handler is not
>>>>>> MPI_ERRORS_ARE_FATAL), what should this function return? Should that
>>>>>> return value be consistent to all processes?
>>>>>>
>>>>>> To preserve locality of fault handling, a local process should not be
>>>>>> explicitly forced to recognize the failure of a peer process that
>>>>>> they never interact with neither directly (e.g., point-to-point) or
>>>>>> indirectly (e.g., collective). So MPI_Finalize should be fault
>>>>>> tolerant and keep trying to complete even in the presence of failures.
>>>>>>
>>>>>> MPI_Finalize is not required to be a collective operation, though it
>>>>>> is often implemented that way. An implementation may need to delay
>>>>>> the return from MPI_Finalize until its role in the failure
>>>>>> information distribution channel is complete. But we should not
>>>>>> require a multi- phase commit protocol to ensure that everyone either
>>>>>> succeeds or returns some error. Implementations may do so internally
>>>>>> in order to ensure that MPI_Finalize does not hang.
>>>>>>
>>>>>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP
>>>>>> indicating a 'new to this rank' failure), what good is this
>>>>>> information to the application? It cannot query for which rank(s)
>>>>>> failed since MPI has been finalized. Nor can it initiate recovery.
>>>>>> The best it could do is assume that all other processes failed and
>>>>>> take local action.
>>>>>>
>>>>>>
>>>>>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>>>>>> -------------------------------------------- In chapter 8, Example
>>>>>> 8.7 illustrates that "Although it is not required that all processes
>>>>>> return from MPI_Finalize, it is required that at least process 0 in
>>>>>> MPI_COMM_WORLD return, so that users can know that the MPI portion of
>>>>>> the computation is over."
>>>>>>
>>>>>> We deduced that the reasoning for this explanation was to allow for
>>>>>> MPI implementation that create and destroy MPI processes during
>>>>>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>>>>>> rank that can be assumed to exist before MPI_Init and after
>>>>>> MPI_Finalize.
>>>>>>
>>>>>> Problem: So what if rank 0 fails at some point during the computation
>>>>>> (or just some point during MPI_Finalize)?
>>>>>>
>>>>>> In the proposal, I added an advice to users to tell them to not
>>>>>> depend on any specific ranks to exist before MPI_Init or after
>>>>>> MPI_Finalize. So, in a faulty environment, the example will produce
>>>>>> incorrect results under certain failure scenarios (e.g., failure of
>>>>>> rank 0).
>>>>>>
>>>>>> In an MPI environment that depends on rank 0 for process creation and
>>>>>> destruction, the failure of rank 0 is (should be?) critical and the
>>>>>> MPI implementation will either abort the job or return
>>>>>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So
>>>>>> we believe that the advice to users was a sufficient addition to this
>>>>>> section. What do others think?
>>>>>>
>>>>>>
>>>>>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What
>>>>>> do folks think about the presented problems and possible solutions?
>>>>>> Are there other issues not mentioned here that we should be
>>>>>> addressing?
>>>>>>
>>>>>> -- Josh
>>>>>>
>>>>>> Run-Through Stabilization Proposal:
>>>>>> https://***svn.mpi-forum.org/trac/mpi-forum-
>>>>>> web/wiki/ft/run_through_stabilization
>>>>>>
>>>>>> ------------------------------------
>>>>>> Joshua Hursey
>>>>>> Postdoctoral Research Associate
>>>>>> Oak Ridge National Laboratory
>>>>>> http://***www.***cs.indiana.edu/~jjhursey
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> http://***lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://*www.*cs.indiana.edu/~jjhursey
>
>
>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>