[Mpi3-ft] A few notes from UTK about the RTS proposal

Wed Dec 7 11:30:42 CST 2011

On Dec 7, 2011, at 11:55 AM, Josh Hursey wrote:

> On Mon, Dec 5, 2011 at 2:51 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>
>> On Dec 2, 2011, at 2:00 PM, Josh Hursey wrote:
>>
>>> As I mentioned on Wednesday's call, Rich and I met with the folks at
>>> UTK on Thursday to discuss the run-through stabilization proposal. It
>>> was quite a productive meeting. Since we did not get through the whole
>>> proposal, we are having another meeting on Dec. 12 from which I will
>>> surely have more notes.
>>>
>>> Below are a few notes that I gathered from the meeting that we should
>>> discuss on the call, if not on the list beforehand.
>>>
>>> So this is a bit long, and I am sure that I am simplifying/missing
>>> details that were discussed in the meeting. UTK folks, if there are
>>> any points that are not appropriately clear, pipe-up and let us know.
>>>
>>> Thanks,
>>> Josh
>>>
>>>
>>>
>>> 17.5.6: Startup: MPI_Init
>>> -------------------------
>>> Problem: This section is still unclear, and needs further clarification.
>>>
>>> If MPI_Init fails to initialize the library it should return an error.
>>> The user cannot call MPI_Init again, so what do we expect the user to
>>> do in a fault tolerant application?
>>
>> At the very least, it would allow the application to print an error message.  There are libraries that use MPI internally, where the user writes a non-MPI program and MPI_Init is called from within a library call.  I can imagine an application that tries to init a library and if it failed, it would exit gracefully, by closing files, etc.  Or it may even try a different library.
>>
>> I realize that the MPI standard does not guarantee anything about the state of the application/environment/etc before MPI_Init is called, so we might not be able to talk precisely about what happens if MPI_Init fails anyway.
>
>
> Yeah. I think that the rationale should be clarified a bit in this
> case. One rule of thumb with a library is that 'thou shall not call
> exit unless there are extraordinary circumstances." So for a library
> that is calling MPI_Init, and trying to live by this rule they would
> prefer to get an error message that they can return up the stack. So
> even though MPI cannot be started (and maybe the calling library) this
> might not be a critical event for an application that can fall back to
> a different technique.
>
>
>>
>>> If MPI_Init encounters a process failure, but can initialize properly
>>> then it should return success. The process failure notification should
>>> be delayed until sometime after MPI_Init.
>>>
>>> For process recovery, we may want to clarify "the number of processes
>>> that the user asked for." This way if some number of processes F
>>> failed during startup, then the application has the ability to restart
>>> F processes within MPI_COMM_WORLD. We may not want to address this in
>>> the stabilization proposal, but is something we should discuss for the
>>> recovery proposal.
>>>
>>>
>>> 10.5.4: Connectivity
>>> -------------------------
>>> Problem: The sentiment was that by using MPI_Abort to only abort the
>>> processes in the specified 'comm' then we are doing dangerous thing
>>> with the notion of 'connected' as defined in this section.
>>>
>>> Given the bulleted statement:
>>> "If a process terminates without calling MPI_FINALIZE, independent
>>> processes are not affected but the effect on connected processes is
>>> not defined."
>>>
>>> We might want to clarify the 'is not defined' portion of this
>>> statement. In this section there is an additional note about
>>> MPI_Abort, in the proposal we are saying the MPI_Abort only terminates
>>> the processes in the specified communicator (Section 8.3).
>>>
>>> So we might need to look at this section and clarify as necessary.
>>>
>>
>> We haven't changed what the standard says about MPI_Abort, we're just restating it.  But it looks like we need to address the "not defined" part.  That also contradicts the stabilization proposal:  If a process fails, we want to define what happens with the connected processes.
>>
>>>
>>> 17.6: MPI_ANY_SOURCE
>>> -------------------------
>>> Problem: The model is too complex and can be simplified.
>>>
>>> A blocking MPI_Recv(ANY) will return an error when a new failure
>>> occurs. A nonblocking MPI_Irecv(ANY) request, during the
>>> MPI_Wait/Test, will return a 'warning' (MPI_ERR_ANY_SOURCE_DISABLED)
>>> and not complete when a new failure occurs.
>>>
>>> There was a debate if the user needs to call another function (like
>>> MPI_Comm_reenable_any_source) to reactivate the ANY_SOURCE receives,
>>> or if the fact that the error was returned to the user is sufficient
>>> to reactivate ANY_SOURCE. The ones that returned a 'warning' error can
>>> be matched/completed if while the user is handling the error a
>>> matching send arrives. So this kind of just works like an interrupt.
>>>
>>> Additionally, once an error is returned should we remove the
>>> protection that "Any new MPI_ANY_SOURCE receive operations using a
>>> communicator with MPI_ANY_SOURCE receive operations disabled will not
>>> be posted, and marked with an error code of the class
>>> MPI_ERR_PROC_FAIL_STOP." So new MPI_Recv(ANY) would post successfully.
>>>
>>> The sentiment is that the user is notified of the process failure via
>>> the return code from the MPI_Wait/Test, and if they do another
>>> MPI_Wait/Test or post a new ANY_SOURCE receive they implicitly
>>> acknowledge the failed process. To avoid a race condition where
>>> another process fails between the error return and the MPI_Wait/Test,
>>> the library needs to associate an 'epoch' value with the communicator
>>> to make sure that the next MPI_Wait/Test returns another error.
>>>
>>> I like the idea of the user doing an explicit second operation, like
>>> MPI_Comm_reenable_any_source, since then we know that the user is
>>> aware of the failed group (and seems more MPI-like to me). However, I
>>> like the more flexible semantics. So I'm a bit split on this one.
>>>
>>
>> There's still a race with threads:  A thread posts a blocking AS receive.  The receive completes with an error because some process in the comm failed.  Another thread posts a blocking AS receive immediately after the receive on the first thread returns.
>>
>> Even though, from the MPI library's point of view, the application has been "informed" of the failure when the first receive returned with an error, there wasn't enough time for the application to react to the error and prevent the second thread from posting the receive.
>>
>> If we feel the anysource stuff is too complicated, we can just punt on it:  Process failures don't affect AS receives.  The app needs to handle potential hangs itself by using nonblocking AS receives, and failure callbacks where AS receives can be cancelled (or drained).
>
>
> That's a good point (per my earlier email in this thread). I think we
> should try to capture that scenario somewhere for future reference.
>
>>
>>>
>>> 17.6: MPI_COMM_REENABLE_ANY_SOURCE
>>> ----------------------------------
>>> Problem: The name of the function "MPI_COMM_REENABLE_ANY_SOURCE" is
>>> misleading since posted MPI_Irecv's are left in the matching queue.
>>>
>>> Posted MPI_Irecv(ANY)'s are not disabled when a new failure occurs and
>>> reactivated when this function is called. The 'disable' and
>>> '(re)enable' parlance seems to indicate that all of the operations
>>> completed, when they may have not.
>>>
>>> So the suggestion was that we talk in terms of
>>> deactivating/reactivating ANY_SOURCE operations. Replacing this word
>>> would put us close to the 30 character limit. I don't know if there is
>>> better language we can use to communicate this idea or not.
>>>
>>
>> I never really liked the name either, I just didn't have a better alternative.  People had trouble with this at the plenary too.  Maybe we just need to make clear that the only thing that's being disabled/re-enabled is the ability to POST new anysource receives.
>>
>> I'm all for a new name, though.
>
> Some options: (maybe?)
>
> 20: MPI_Comm__any_source (base)
> + 8: reenable
> +10: reactivate
> + 5: allow
> + 7: approve
> + 5: bless

[rich] if we take the approach suggested at the meeting in UTK, we don't need this at all.  Checking on the failed group has the same outcome.

Rich

>
>
>>
>>>
>>> 17.6: MPI_COMM_DRAIN
>>> ----------------------------------
>>> Problem: If MPI_Comm_drain is collective and meant to be called in the
>>> FailHandler then we need some ordering of when the FailHandlers are
>>> triggered in order to avoid deadlocks.
>>>
>>> This seemed to be less of a problem with MPI_Comm_drain and more of a
>>> problem with FailHandler triggering (see below).
>>>
>>> It was mentioned that this operation might be able to be a local
>>> operation by associating a another matching bit ('epoch') to the
>>> communication channel. Then incrementing the epoch will exclude all
>>> previous messages from matching. We don't say that the collective
>>> MPI_Comm_drain is synchronizing, so I think this would be a valid
>>> implementation of this operation. Something to keep in mind.
>>>
>>> We need to also clarify that MPI_Comm_drain is a fault tolerant
>>> collective, so there is no need to call MPI_Comm_validate beforehand.
>>
>> OK
>>
>>>
>>> 17.7.1: Validating Communicators
>>> ----------------------------------
>>> Problem: Would like to be able to fix all of the communication
>>> problems resulting from a newly failed process outside of the critical
>>> path.
>>>
>>> The sentiment was that with the FailHandler and
>>> MPI_Comm_dup/MPI_Comm_free, one should be able to eliminate the need
>>> to ever call MPI_Comm_validate. So why have it in the standard?
>>
>> To allow other models of FT applications.  I think we'd like to avoid forcing all apps to tear all communicators down and restarting for every failure.  Validate allows one to continue using existing communicators with failed processes.
>
>
> I agree. If a programmer wants to not use it, they are able to. Having
> it in there helps in (what I would say) a common style of MPI
> programming.
>
>
>>
>>> So in the FailHandler the user would call MPI_Comm_dup/MPI_Comm_split
>>> on the communicator provided, and get a replacement that is
>>> collectively active. Then they would MPI_Comm_free the old
>>> communicator (maybe doing a MPI_Comm_drain to flush messages before
>>> freeing it). Then replace the comm pointer with the new dup'ed
>>> communicator.
>>>
>>> MPI_Comm_dup does not guarantee that the resulting communicator is
>>> collectively active. So this might not work in the current proposal.
>>> But if it did, then you might be able to get away with this trick.
>>
>> If you're already duping/splitting and draining, I think doing a validate shouldn't be much more overhead, especially if this is done only on an an error.
>
>
> Maybe we should remove the restrictions on the FailHandler to allow
> the user to use any MPI function in the handler.
>
>
>>
>>>
>>> 17.8.2: Communicator Management
>>> ----------------------------------
>>> Problem: We make no statement on whether the newly created
>>> communicator is collectively active or not.
>>>
>>> If you pass in a collectively inactive communicator, it may be likely
>>> that you get out a collectively inactive new communicator. Since we do
>>> not know if all alive processes in the communicator know of the same
>>> set of failed processes.
>>>
>>> If we required that the new communicator be collectively active, then
>>> we preclude the implementation of local functioning optimizations like
>>> caching cids for MPI_Comm_dup.
>>>
>>> If you pass in a collectively active communicator, then the user may
>>> expect that the new communicator is also collectively active. If a new
>>> process failure emerges during the creation operation then the
>>> reporting of it may be delayed until after the creation operation has
>>> completed. So pretending that the failure happened just after the
>>> creation operation.
>>>
>>> So do we need to explicitly clarify this point? Would making an advice
>>> to implementors regarding 'the newly created communicator should be
>>> collectively active' help address the problem?
>>
>> As you pointed out, it's not useful to say that a new communicator is collectively active, since a new failure may change that.  What we can say is that a communicator created from an inactive communicator is NOT collectively active.  In the case of a comm_split, it's possible that a new communicator has no failed processes.  In which case, I'd imagine, that we'd want it to be collectively active.
>
>
> So I'm going back and forth on this one. I guess if we say nothing
> (like we do now), what does that mean for the program developer? Do
> they have to take any precaution that they would not otherwise? I
> don't think so, but I need to ponder it a bit more.
>
>
>>
>>>
>>> 17.5.1: Process Failure Handlers
>>> ----------------------------------
>>> Problem: FailHandlers are only called if a process fails on the
>>> associated communication object. If the application is using many
>>> different communicators each with a FailHandler then it is likely that
>>> different FailHandlers will be triggered off of different
>>> communicators. If the application uses MPI_Comm_drain in the
>>> operation, this can lead to deadlock and other badness.
>>
>> I think we should reconsider the requirement that fail handlers are only called when an MPI call is performed on the associated object.  I think it complicates things and doesn't buy us much.
>>
>> That said, we can still have an ordering problem.
>>
>>>
>>> So this is a tough problem. What we want is to trigger all
>>> FailHandlers including the process failure (maybe all FailHandlers
>>> regardless of if they include the failed process)? Additionally, we
>>> need to trigger then in a consistent order across all processes so
>>> that we avoid deadlock scenarios.
>>>
>>> It was proposed that we create something like a global partial
>>> ordering of FailHandler calls. The FailHandler
>>> registration/deregistration calls would be collective (so we can
>>> coordinate the call ordering). The order of the calls is determined by
>>> the MPI library, but the user is guaranteed that FailHandlers will be
>>> called in the same order at all processes.
>>>
>>> There is some pretty complex hand waving associated with doing this,
>>> and I would do a disservice trying to capture here since it was not my
>>> idea. But that is the general problem with FailHandlers that we need
>>> to find a way to work around.
>>>
>>> Additionally they wanted clarification that the FailHandler is
>>> triggered with a globally consistent group of failed processes. So
>>> this implies that the FailHandler is calling an agreement protocol
>>> (like MPI_Comm_validate) behind the scenes to generate this level of
>>> consistency. I would think that the MPI library would have to delay
>>> reporting of additional process failures that may occur during the
>>> FailHandler operation until after they are complete.
>>>
>>
>> I think we should keep the failure handler stuff local, without coordination.  If the application needs things like partial order of failure notification, the stabilization proposal should have the features that would allow the application (or third a party library) to implement that.
>>
>> We talked about doing the drain and dup stuff in the failure handler in order to provide FT-MPI-like behavior.  In this case, when a process fails, all communicators need to be torn down, not just the ones with failed processes.  The application would need to keep a list of communicators that need to be freed, and then drain/free them in the same order at every process.
>>
>> We could provide a drain_all convenience function that would drain/free all communicators and just leave comm_world.  This could then be used for FT-MPI-style FT.
>
>
> Let's discuss this one on the call.
>
> Thanks,
> Josh
>
>
>>
>> -d
>>
>>
>>>
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> hxxp://users.nccs.gov/~jjhursey
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> hxxp://users.nccs.gov/~jjhursey
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>