[Mpi3-ft] MPI_Init / MPI_Finalize

Thu Aug 26 13:44:30 CDT 2010

I just updated the ticket.
  https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/227

Anyone want to help review this (I'll go hunting for folks a bit later if I don't hear anything).

Thanks,
Josh

On Aug 26, 2010, at 12:03 PM, Bronis R. de Supinski wrote:

> 
> Josh:
> 
> Your ticket looks fine to me. I suggest we just move it to
> waiting for reviews and gather the four reviews (mine will
> count as the first one).
> 
> Bronis
> 
> 
> On Thu, 26 Aug 2010, Joshua Hursey wrote:
> 
>> 
>> On Aug 26, 2010, at 11:03 AM, Bronis R. de Supinski wrote:
>> 
>>> 
>>> Josh:
>>> 
>>> Re:
>>>> Bronis, thanks for the clarification on MPI_Finalize. I guess the
>>>> detail that I was trying to get at is that it is not specified whether
>>>> MPI_Finalize is a leave early/enter late kind of collective.
>>>> 
>>>> So, at least the way I read it, it would be valid if one process enters
>>>> and exits MPI_Finalize  before a different process enters MPI_Finalize
>>>> (similar to MPI_Bcast).
>>> 
>>> Yes, that is correct. Since nothing is explicitly stated, it falls into
>>> the general category of collectives that users must treat as synchronizing
>>> (in terms of deadlock) although implementations may not be.
>>> 
>>>> This points to the reasoning behind the advice to implementors just
>>>> above the cited paragraph that suggests a barrier operation during
>>>> MPI_Finalize as one option. Am I interpreting this correctly?
>>> 
>>> Yes.
>>> 
>>>> One misleading sentence to me is the following on p291 just after the
>>>> definition of MPI_Finalize: "Each process must call MPI_FINALIZE before
>>>> it exits." The 'it' is slightly unclear. I think this is referring to
>>>> the process exiting, not the function. If 'it' referred to the function
>>>> then this would disallow the collective to have some ranks leave the
>>>> collective before all have joined, so requiring a barrier semantic.
>>> 
>>> I can see the ambiguity. I would be in favor of rewriting the
>>> sentence to eliminate the pronoun or at least the ambiquity.
>>> How about: "Before each process exits, it must call MPI_FINALIZE."
>>> I think that is clearly what was intended (not from the immediate
>>> context but from the general treatment of collectives).
>> 
>> I agree. I filed a ticket on Trac (not sure if I did it 100% correctly) so we can try to get this ambiguity fixed.
>> https://*svn.mpi-forum.org/trac/mpi-forum-web/ticket/227
>> 
>>> 
>>>> For the fault tolerance discussion, I think the question is on the
>>>> consistency of the return code. Should the return code have commit/abort
>>>> properties? So if there is a success then all processes return success.
>>>> If some process fails during MPI_Finalize, should all processes return
>>>> some error code.
>>> 
>>> Not having followed everything in the FT working group closely,
>>> it is hard for me to answer that in terms of what the group
>>> thinking is. However, it is clear to me that you cannot require
>>> a collective to be synchronizing just to ensure return code
>>> agreement. I would feel that was antithetical to the primary
>>> goal of MPI. Perhaps a user could poll to find out if an error
>>> occurs subsequently. I suppose that would result in a call that
>>> could be made after MPI_FINALIZE. However, I think I would
>>> argue that once you call MPI_FINALIZE, you don't care...
>>> 
>>>> It is unclear to me if this property (commit/abort return codes) is
>>>> really useful to the application. If all processes return success then a
>>>> process fails directly afterwards, the other processes have no way of
>>>> being notified. So what action could the remaining processes
>>>> realistically take in either the success or failure case.
>>> 
>>> Further, those processes may not even exist. It is not really
>>> clear what happens to most processes after MPI_FINALIZE, which
>>> I felt was your primary point initially and with which I agree.
>>> Ultimately, why would we want to create any possible performance
>>> penalty to disseminate errors during/after MPI_FINALIZE?
>>> 
>>>> So my suggestion is that we allow MPI_Finalize to preserve its loose
>>>> synchrony, leave early collective property as long as the rank is no
>>>> longer needed to continue interacting with any of the connected process
>>>> (say for relaying error information). This means that some ranks may
>>>> return success while other return error if a process fails during
>>>> finalize. MPI implementations may choose to provide applications with
>>>> commit/abort semantics, but are not required to do so.
>>>> 
>>>> Does that sounds reasonable for MPI_Finalize?
>>> 
>>> Yes, I agree.
>> 
>> I also agree with your statements above. If the process calls MPI_Finalize then it is done with MPI and it shouldn't care if there were errors.
>> 
>> Thanks,
>> Josh
>> 
>>> 
>>> Bronis
>>> 
>>>> Thanks,
>>>> Josh
>>>> 
>>>> On Aug 26, 2010, at 12:15 AM, Fab Tillier wrote:
>>>> 
>>>>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 21:08:36
>>>>> 
>>>>>> 
>>>>>> Fab:
>>>>>> 
>>>>>> There is no wiggle room. MPI_FINALIZE is collective across
>>>>>> MPI_COMM_WORLD. I do not understand why you would say otherwise.
>>>>>> Here is more of the passage I was quoting:
>>>>>> 
>>>>>> -----------------
>>>>>> 
>>>>>> MPI_FINALIZE is collective over all connected processes. If no processes
>>>>>> were spawned, accepted or connected then this means over MPI_COMM_WORLD;
>>>>> 
>>>>> Ahh, I missed this part, sorry.
>>>>> 
>>>>> -Fab
>>>>> 
>>>>>> otherwise it is collective over the union of all processes that have
>>>>>> been and continue to be connected, as explained in Section Releasing
>>>>>> Connections  on page Releasing Connections.
>>>>>> 
>>>>>> -----------------
>>>>>> 
>>>>>> The "connected" terminology is used to handle dynamic process
>>>>>> management issues, for which the set of all processes cannot
>>>>>> easily be defined in terms of a single communicator.
>>>>>> 
>>>>>> Bronis
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wed, 25 Aug 2010, Fab Tillier wrote:
>>>>>> 
>>>>>>> What defines "connected"?  MPI_FINALIZE isn't collective across
>>>>>> MPI_COMM_WORLD, as processes might never communicate with one another.
>>>>>> Even if they do, communication may not require a connection, so they
>>>>>> may never be connected.
>>>>>>> 
>>>>>>> It seems to me there might be enough wiggle room in the standard to
>>>>>>> allow MPI_Finalize to not be collective at all?
>>>>>>> 
>>>>>>> -Fab
>>>>>>> 
>>>>>>> Bronis R. de Supinski wrote on Wed, 25 Aug 2010 at 15:06:38
>>>>>>> 
>>>>>>>> 
>>>>>>>> Josh:
>>>>>>>> 
>>>>>>>> On p293 of the 2.2 standard, it says "MPI_FINALIZE is collective
>>>>>>>> over all connected processes." I don't know that the call being
>>>>>>>> collective changes your analysis but your statement that the
>>>>>>>> call is not collective was incorrect...
>>>>>>>> 
>>>>>>>> Bronis
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, 25 Aug 2010, Joshua Hursey wrote:
>>>>>>>> 
>>>>>>>>> During the discussion of the run-though stabilization proposal today
>>>>>>>>> on the teleconf, we spent a while discussing the expected behavior of
>>>>>>>>> MPI_Init and MPI_Finalize in the presence of process failures. I
>>>>>>>>> would like to broaden the discussion a bit to help pin down the
>>>>>>>>> expected behavior.
>>>>>>>>> 
>>>>>>>>> MPI_Init(): ----------- Problem: If a process fails before or during
>>>>>>>>> MPI_Init, what should the MPI implementation do?
>>>>>>>>> 
>>>>>>>>> The current standard says nothing about the return value of
>>>>>>>>> MPI_Init() (Ch. 8.7). To the greatest possible extent the application
>>>>>>>>> should not be put in danger if it wishes to ignore errors (assumes
>>>>>>>>> MPI_ERRORS_ARE_FATAL), so returning an error from this function (in
>>>>>>>>> contrast to aborting the job) might be dangerous. However, if the
>>>>>>>>> application is prepared to handle process failures, it is unable to
>>>>>>>>> communicate that information to the MPI implementation until after
>>>>>>>>> the completion of MPI_Init().
>>>>>>>>> 
>>>>>>>>> So a couple of solutions were presented each with pros and cons
>>>>>>>>> (please fill in if I missed any): 1) If a process fails in MPI_Init()
>>>>>>>>> (default error handler is MPI_ERRORS_ARE_FATAL) then the entire job
>>>>>>>>> is aborted (similar to calling MPI_Abort on MPI_COMM_WORLD).
>>>>>>>>> 
>>>>>>>>> 2) If a process fails in MPI_Init() the MPI implementation will
>>>>>>>>> return an appropriate error code/class (e.g.,
>>>>>>>>> MPI_ERR_RANK_FAIL_STOP), and all subsequent calls into the MPI
>>>>>>>>> implementation will return the error class MPI_ERR_OTHER (should be
>>>>>>>>> create a MPI_ERR_NOT_ACTIVE?). Applications should eventually notice
>>>>>>>>> the error and terminate.
>>>>>>>>> 
>>>>>>>>> 3) Allow the application to register only the MPI_ERRORS_RETURN
>>>>>>>>> handle on MPI_COMM_WORLD before MPI_Init() using the
>>>>>>>>> MPI_Errhandler_set() function. Errors that occur before the
>>>>>>>>> MPI_Errhandler_set() call are fatal. Errors afterward, including
>>>>>>>>> during MPI_Init() are not fatal.
>>>>>>>>> 
>>>>>>>>> In the cases where MPI_Init() returns MPI_ERR_RANK_FAIL_STOP to
>>>>>>>>> indicate a process failure, is the library usable or not? If the
>>>>>>>>> application can continue running through the failure, then the MPI
>>>>>>>>> library should still be usable, thus MPI_Init() must be fault
>>>>>>>>> tolerant in its initialization to be able to handle process failures.
>>>>>>>>> If the MPI implementation finds itself in trouble and cannot continue
>>>>>>>>> it should return MPI_ERR_CANNOT_CONTINUE from all subsequent calls
>>>>>>>>> including MPI_Init, if possible.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> MPI_Finalize(): --------------- Problem: If a process fails before or
>>>>>>>>> during MPI_Finalize (and the error handler is not
>>>>>>>>> MPI_ERRORS_ARE_FATAL), what should this function return? Should that
>>>>>>>>> return value be consistent to all processes?
>>>>>>>>> 
>>>>>>>>> To preserve locality of fault handling, a local process should not be
>>>>>>>>> explicitly forced to recognize the failure of a peer process that
>>>>>>>>> they never interact with neither directly (e.g., point-to-point) or
>>>>>>>>> indirectly (e.g., collective). So MPI_Finalize should be fault
>>>>>>>>> tolerant and keep trying to complete even in the presence of failures.
>>>>>>>>> 
>>>>>>>>> MPI_Finalize is not required to be a collective operation, though it
>>>>>>>>> is often implemented that way. An implementation may need to delay
>>>>>>>>> the return from MPI_Finalize until its role in the failure
>>>>>>>>> information distribution channel is complete. But we should not
>>>>>>>>> require a multi- phase commit protocol to ensure that everyone either
>>>>>>>>> succeeds or returns some error. Implementations may do so internally
>>>>>>>>> in order to ensure that MPI_Finalize does not hang.
>>>>>>>>> 
>>>>>>>>> If MPI_Finalize returns an error (say MPI_ERR_RANK_FAIL_STOP
>>>>>>>>> indicating a 'new to this rank' failure), what good is this
>>>>>>>>> information to the application? It cannot query for which rank(s)
>>>>>>>>> failed since MPI has been finalized. Nor can it initiate recovery.
>>>>>>>>> The best it could do is assume that all other processes failed and
>>>>>>>>> take local action.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> MPI_Finalize: MPI_COMM_WORLD process rank 0:
>>>>>>>>> -------------------------------------------- In chapter 8, Example
>>>>>>>>> 8.7 illustrates that "Although it is not required that all processes
>>>>>>>>> return from MPI_Finalize, it is required that at least process 0 in
>>>>>>>>> MPI_COMM_WORLD return, so that users can know that the MPI portion of
>>>>>>>>> the computation is over."
>>>>>>>>> 
>>>>>>>>> We deduced that the reasoning for this explanation was to allow for
>>>>>>>>> MPI implementation that create and destroy MPI processes during
>>>>>>>>> init/finalize from rank 0. Or worded differently, rank 0 is the only
>>>>>>>>> rank that can be assumed to exist before MPI_Init and after
>>>>>>>>> MPI_Finalize.
>>>>>>>>> 
>>>>>>>>> Problem: So what if rank 0 fails at some point during the computation
>>>>>>>>> (or just some point during MPI_Finalize)?
>>>>>>>>> 
>>>>>>>>> In the proposal, I added an advice to users to tell them to not
>>>>>>>>> depend on any specific ranks to exist before MPI_Init or after
>>>>>>>>> MPI_Finalize. So, in a faulty environment, the example will produce
>>>>>>>>> incorrect results under certain failure scenarios (e.g., failure of
>>>>>>>>> rank 0).
>>>>>>>>> 
>>>>>>>>> In an MPI environment that depends on rank 0 for process creation and
>>>>>>>>> destruction, the failure of rank 0 is (should be?) critical and the
>>>>>>>>> MPI implementation will either abort the job or return
>>>>>>>>> MPI_ERR_CANNOT_CONTINUE from all calls to the MPI implementation. So
>>>>>>>>> we believe that the advice to users was a sufficient addition to this
>>>>>>>>> section. What do others think?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> So MPI_Init seems to be a more complex issue than MPI_Finalize. What
>>>>>>>>> do folks think about the presented problems and possible solutions?
>>>>>>>>> Are there other issues not mentioned here that we should be
>>>>>>>>> addressing?
>>>>>>>>> 
>>>>>>>>> -- Josh
>>>>>>>>> 
>>>>>>>>> Run-Through Stabilization Proposal:
>>>>>>>>> https://****svn.mpi-forum.org/trac/mpi-forum-
>>>>>>>>> web/wiki/ft/run_through_stabilization
>>>>>>>>> 
>>>>>>>>> ------------------------------------
>>>>>>>>> Joshua Hursey
>>>>>>>>> Postdoctoral Research Associate
>>>>>>>>> Oak Ridge National Laboratory
>>>>>>>>> http://****www.****cs.indiana.edu/~jjhursey
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> mpi3-ft mailing list
>>>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>>>> http://****lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> mpi3-ft mailing list
>>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>>> http://***lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>> 
>>>> 
>>>> ------------------------------------
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://**www.**cs.indiana.edu/~jjhursey
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://**lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> 
>> 
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://*www.*cs.indiana.edu/~jjhursey
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey