[Mpi3-ft] MPI_Init / MPI_Finalize

Thu Aug 26 11:03:15 CDT 2010

Josh:

Your ticket looks fine to me. I suggest we just move it to
waiting for reviews and gather the four reviews (mine will
count as the first one).

Bronis

On Thu, 26 Aug 2010, Joshua Hursey wrote:

>
> On Aug 26, 2010, at 11:03 AM, Bronis R. de Supinski wrote:
>
>>
>> Josh:
>>
>> Re:
>>> Bronis, thanks for the clarification on MPI_Finalize. I guess the
>>> detail that I was trying to get at is that it is not specified whether
>>> MPI_Finalize is a leave early/enter late kind of collective.
>>>
>>> So, at least the way I read it, it would be valid if one process enters
>>> and exits MPI_Finalize  before a different process enters MPI_Finalize
>>> (similar to MPI_Bcast).
>>
>> Yes, that is correct. Since nothing is explicitly stated, it falls into
>> the general category of collectives that users must treat as synchronizing
>> (in terms of deadlock) although implementations may not be.
>>
>>> This points to the reasoning behind the advice to implementors just
>>> above the cited paragraph that suggests a barrier operation during
>>> MPI_Finalize as one option. Am I interpreting this correctly?
>>
>> Yes.
>>
>>> One misleading sentence to me is the following on p291 just after the
>>> definition of MPI_Finalize: "Each process must call MPI_FINALIZE before
>>> it exits." The 'it' is slightly unclear. I think this is referring to
>>> the process exiting, not the function. If 'it' referred to the function
>>> then this would disallow the collective to have some ranks leave the
>>> collective before all have joined, so requiring a barrier semantic.
>>
>> I can see the ambiguity. I would be in favor of rewriting the
>> sentence to eliminate the pronoun or at least the ambiquity.
>> How about: "Before each process exits, it must call MPI_FINALIZE."
>> I think that is clearly what was intended (not from the immediate
>> context but from the general treatment of collectives).
>
> I agree. I filed a ticket on Trac (not sure if I did it 100% correctly) so we can try to get this ambiguity fixed.
>  https://*svn.mpi-forum.org/trac/mpi-forum-web/ticket/227
>
>>
>>> For the fault tolerance discussion, I think the question is on the
>>> consistency of the return code. Should the return code have commit/abort
>>> properties? So if there is a success then all processes return success.
>>> If some process fails during MPI_Finalize, should all processes return
>>> some error code.
>>
>> Not having followed everything in the FT working group closely,
>> it is hard for me to answer that in terms of what the group
>> thinking is. However, it is clear to me that you cannot require
>> a collective to be synchronizing just to ensure return code
>> agreement. I would feel that was antithetical to the primary
>> goal of MPI. Perhaps a user could poll to find out if an error
>> occurs subsequently. I suppose that would result in a call that
>> could be made after MPI_FINALIZE. However, I think I would
>> argue that once you call MPI_FINALIZE, you don't care...
>>
>>> It is unclear to me if this property (commit/abort return codes) is
>>> really useful to the application. If all processes return success then a
>>> process fails directly afterwards, the other processes have no way of
>>> being notified. So what action could the remaining processes
>>> realistically take in either the success or failure case.
>>
>> Further, those processes may not even exist. It is not really
>> clear what happens to most processes after MPI_FINALIZE, which
>> I felt was your primary point initially and with which I agree.
>> Ultimately, why would we want to create any possible performance
>> penalty to disseminate errors during/after MPI_FINALIZE?
>>
>>> So my suggestion is that we allow MPI_Finalize to preserve its loose
>>> synchrony, leave early collective property as long as the rank is no
>>> longer needed to continue interacting with any of the connected process
>>> (say for relaying error information). This means that some ranks may
>>> return success while other return error if a process fails during
>>> finalize. MPI implementations may choose to provide applications with
>>> commit/abort semantics, but are not required to do so.
>>>
>>> Does that sounds reasonable for MPI_Finalize?
>>
>> Yes, I agree.
>
> I also agree with your statements above. If the process calls MPI_Finalize then it is done with MPI and it shouldn't care if there were errors.
>
> Thanks,
> Josh
>
>>
>> Bronis
>>
>>> Thanks,
>>> Josh
>>>
>>> On Aug 26, 2010, at 12:15 AM, Fab Tillier wrote:
>>>
>>>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 21:08:36
>>>>
>>>>>
>>>>> Fab:
>>>>>
>>>>> There is no wiggle room. MPI_FINALIZE is collective across
>>>>> MPI_COMM_WORLD. I do not understand why you would say otherwise.
>>>>> Here is more of the passage I was quoting:
>>>>>
>>>>> -----------------
>>>>>
>>>>> MPI_FINALIZE is collective over all connected processes. If no processes
>>>>> were spawned, accepted or connected then this means over MPI_COMM_WORLD;
>>>>
>>>> Ahh, I missed this part, sorry.
>>>>
>>>> -Fab
>>>>
>>>>> otherwise it is collective over the union of all processes that have
>>>>> been and continue to be connected, as explained in Section Releasing
>>>>> Connections  on page Releasing Connections.
>>>>>
>>>>> -----------------
>>>>>
>>>>> The "connected" terminology is used to handle dynamic process
>>>>> management issues, for which the set of all processes cannot
>>>>> easily be defined in terms of a single communicator.
>>>>>
>>>>> Bronis
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 25 Aug 2010, Fab Tillier wrote:
>>>>>
>>>>>> What defines "connected"?  MPI_FINALIZE isn't collective across
>>>>> MPI_COMM_WORLD, as processes might never communicate with one another.
>>>>> Even if they do, communication may not require a connection, so they
>>>>> may never be connected.
>>>>>>
>>>>>> It seems to me there might be enough wiggle room in the standard to
>>>>>> allow MPI_Finalize to not be collective at all?
>>>>>>
>>>>>> -Fab
>>>>>>
>>>>>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 15:06:38
>>>>>>
>>>>>>>
>>>>>>> Josh:
>>>>>>>
>>>>>>> On p293 of the 2.2 standard, it says "MPI_FINALIZE is collective
>>>>>>> over all connected processes." I don't know that the call being
>>>>>>> collective changes your analysis but your statement that the
>>>>>>> call is not collective was incorrect...
>>>>>>>
>>>>>>> Bronis
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 25 Aug 2010, Joshua Hursey wrote:
>>>>>>>
>>>>>>>> During the discussion of the run-though stabilization proposal today
>>>>>>>> on the teleconf, we spent a while discussing the expected behavior of
>>>>>>>> MPI_Init and MPI_Finalize in the presence of process failures. I
>>>>>>>> would like to broaden the discussion a bit to help pin down the
>>>>>>>> expected behavior.
>>>>>>>>
>>>>>>>> MPI_Init(): ----------- Problem: If a process fails before or during
>>>>>>>> MPI_Init, what should the MPI implementation do?
>>>>>>>>
>>>>>>>> The current standard says nothing about the return value of
>>>>>>>> MPI_Init() (Ch. 8.7). To the greatest possible extent the application
>>>>>>>> should not be put in danger if it wishes to ignore errors (assumes
>>>>>>>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>>>>>>>> contrast to aborting the job) might be dangerous. However, if the
>>>>>>>> application is prepared to handle process failures, it is unable to
>>>>>>>> communicate that information to the MPI implementation until after
>>>>>>>> the completion of MPI_Init().
>>>>>>>>
>>>>>>>> So a couple of solutions were presented each with pros and cons
>>>>>>>> (please fill in if I missed any): 1) If a process fails in MPI_Init()
>>>>>>>> (default error handler is MPI_ERRORS_ARE_FATAL) then the entire job
>>>>>>>> is aborted (similar to calling MPI_Abort on MPI_COMM_WORLD).
>>>>>>>>
>>>>>>>> 2) If a process fails in MPI_Init() the MPI implementation will
>>>>>>>> return an appropriate error code/class (e.g.,
>>>>>>>> MPI_ERR_RANK_FAIL_STOP), and all subsequent calls into the MPI
>>>>>>>> implementation will return the error class MPI_ERR_OTHER (should be
>>>>>>>> create a MPI_ERR_NOT_ACTIVE?). Applications should eventually notice
>>>>>>>> the error and terminate.
>>>>>>>>
>>>>>>>> 3) Allow the application to register only the MPI_ERRORS_RETURN
>>>>>>>> handle on MPI_COMM_WORLD before MPI_Init() using the
>>>>>>>> MPI_Errhandler_set() function. Errors that occur before the
>>>>>>>> MPI_Errhandler_set() call are fatal. Errors afterward, including
>>>>>>>> during MPI_Init() are not fatal.
>>>>>>>>
>>>>>>>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>>>>>>>> indicate a process failure, is the library usable or not? If the
>>>>>>>> application can continue running through the failure, then the MPI
>>>>>>>> library should still be usable, thus MPI_Init() must be fault
>>>>>>>> tolerant in its initialization to be able to handle process failures.
>>>>>>>> If the MPI implementation finds itself in trouble and cannot continue
>>>>>>>> it should return MPI_ERR_CANNOT_CONTINUE from all subsequent calls
>>>>>>>> including MPI_Init, if possible.
>>>>>>>>
>>>>>>>>
>>>>>>>> MPI_Finalize(): --------------- Problem: If a process fails before or
>>>>>>>> during MPI_Finalize (and the error handler is not
>>>>>>>> MPI_ERRORS_ARE_FATAL), what should this function return? Should that
>>>>>>>> return value be consistent to all processes?
>>>>>>>>
>>>>>>>> To preserve locality of fault handling, a local process should not be
>>>>>>>> explicitly forced to recognize the failure of a peer process that
>>>>>>>> they never interact with neither directly (e.g., point-to-point) or
>>>>>>>> indirectly (e.g., collective). So MPI_Finalize should be fault
>>>>>>>> tolerant and keep trying to complete even in the presence of failures.
>>>>>>>>
>>>>>>>> MPI_Finalize is not required to be a collective operation, though it
>>>>>>>> is often implemented that way. An implementation may need to delay
>>>>>>>> the return from MPI_Finalize until its role in the failure
>>>>>>>> information distribution channel is complete. But we should not
>>>>>>>> require a multi- phase commit protocol to ensure that everyone either
>>>>>>>> succeeds or returns some error. Implementations may do so internally
>>>>>>>> in order to ensure that MPI_Finalize does not hang.
>>>>>>>>
>>>>>>>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP
>>>>>>>> indicating a 'new to this rank' failure), what good is this
>>>>>>>> information to the application? It cannot query for which rank(s)
>>>>>>>> failed since MPI has been finalized. Nor can it initiate recovery.
>>>>>>>> The best it could do is assume that all other processes failed and
>>>>>>>> take local action.
>>>>>>>>
>>>>>>>>
>>>>>>>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>>>>>>>> -------------------------------------------- In chapter 8, Example
>>>>>>>> 8.7 illustrates that "Although it is not required that all processes
>>>>>>>> return from MPI_Finalize, it is required that at least process 0 in
>>>>>>>> MPI_COMM_WORLD return, so that users can know that the MPI portion of
>>>>>>>> the computation is over."
>>>>>>>>
>>>>>>>> We deduced that the reasoning for this explanation was to allow for
>>>>>>>> MPI implementation that create and destroy MPI processes during
>>>>>>>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>>>>>>>> rank that can be assumed to exist before MPI_Init and after
>>>>>>>> MPI_Finalize.
>>>>>>>>
>>>>>>>> Problem: So what if rank 0 fails at some point during the computation
>>>>>>>> (or just some point during MPI_Finalize)?
>>>>>>>>
>>>>>>>> In the proposal, I added an advice to users to tell them to not
>>>>>>>> depend on any specific ranks to exist before MPI_Init or after
>>>>>>>> MPI_Finalize. So, in a faulty environment, the example will produce
>>>>>>>> incorrect results under certain failure scenarios (e.g., failure of
>>>>>>>> rank 0).
>>>>>>>>
>>>>>>>> In an MPI environment that depends on rank 0 for process creation and
>>>>>>>> destruction, the failure of rank 0 is (should be?) critical and the
>>>>>>>> MPI implementation will either abort the job or return
>>>>>>>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So
>>>>>>>> we believe that the advice to users was a sufficient addition to this
>>>>>>>> section. What do others think?
>>>>>>>>
>>>>>>>>
>>>>>>>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What
>>>>>>>> do folks think about the presented problems and possible solutions?
>>>>>>>> Are there other issues not mentioned here that we should be
>>>>>>>> addressing?
>>>>>>>>
>>>>>>>> -- Josh
>>>>>>>>
>>>>>>>> Run-Through Stabilization Proposal:
>>>>>>>> https://****svn.mpi-forum.org/trac/mpi-forum-
>>>>>>>> web/wiki/ft/run_through_stabilization
>>>>>>>>
>>>>>>>> ------------------------------------
>>>>>>>> Joshua Hursey
>>>>>>>> Postdoctoral Research Associate
>>>>>>>> Oak Ridge National Laboratory
>>>>>>>> http://****www.****cs.indiana.edu/~jjhursey
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> mpi3-ft mailing list
>>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>>> http://****lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mpi3-ft mailing list
>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>> http://***lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>
>>>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>
>>> ------------------------------------
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://**www.**cs.indiana.edu/~jjhursey
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://*www.*cs.indiana.edu/~jjhursey
>
>
>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>