[Mpi3-ft] General Network Channel Failure

Mon Jun 13 19:06:22 CDT 2011

I like the idea of a query interface. This is primarily because I believe that fault tolerance should be the responsibility of middleware rather than applications. Writing scalable codes is already hard and writing your own reliability solutions that are also scalable and robust is almost certainly too much to ask for. However, if we leave the job to middleware, we'll need to provide an interface that is both portable and flexible.

If we go for an MPI-level interface we get portability but we also get complexity because MPI provides developers with a large number of capabilities (send to/receive from a rank, collectively communicate on communicator, read/write a file, etc.). A full API will need to allow the application to perform queries to identify which capabilities are available after a given fault. A decent API for this might be to allow users to query for individual capabilities and also allow for short-cut APIs. For example, the run-through API focuses on fail-stop failures of one or more ranks. As such, if MPI reports that a rank has failed, this implies that a fixed set of capabilities is no longer available. We can extend this idea to other failure models that virtualize sets of capabilities under a single name, meaning that a problem with a single virtual capability (slow network or corrupted file) implies a specific change to large number of capabilities (all sends/receives are slow or reads/writes to just this file are invalid).

Further, failure models themselves do not need to be specified in the spec. The reason this might make sense is if we want to model various levels using different abstractions. A simple way to represent network failures is to treat the network as a monolithic entity. In this model MPI_NETWORK_ERROR means that no communication can take place. A more precise abstraction might treat the network as a set of communication islands and one island can fail without causing errors for communication on other islands. An even more precise model may more individual rank-to-rank connections  and allow the user to identify which connections are valid/invalid after a failure. All these models are just different portable approximations of the real system state and are useful for different applications. Right now I'm nervous about choosing one to be in the spec while leaving the others out. However, it should be pretty useful to force MPI to implement some fairly detailed choice and leave the higher-level ones to be implemented outside the spec but provide a standard interface to present them to users.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com

> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
> Sent: Monday, June 13, 2011 2:06 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] General Network Channel Failure
>
> Yeah the 'system model' is probably not going to fly with the MPI Forum.
> Mostly because we would be in the business then of defining what a 'node',
> 'NIC', 'RAM', etc. are - something the standard strives to not defined for
> future proofing.
>
> So I like the MPI model a bit more. For resource exhaustion we do get a bit of
> wiggle room with the 'reliable communication' requirement in
> 3.7 "If the call causes some system resource to be exhausted, then it will fail
> and return an error code." And for memory in 8.2 "The function
> MPI_ALLOC_MEM may return an error code of class MPI_ERR_NO_MEM to
> indicate it failed because memory is exhausted."
>
> So even though MPI shys away from defining network resources and
> memory, we might be able to find some acceptable language within the
> confines of these few error statements. So I think we can talk about
> connections between pairs (P2P) and groups (collectives) of processes, and
> maybe even connections to files (I/O). For each error class that we want to
> define we need to go through the process of asking "so if this error code was
> returned from this operation, what would that mean to the programmer?
> How do we want them to react to it?" Taking resource exhaustion as an
> example, for point-to-point messages if the application receives such an
> error then they may want to repost the message later. But if the same error
> is returned locally from a collective operation we need to determine if the
> user should repost the same collective or if the collective is guaranteed to
> return at all other processes with either success or some other error.
>
> I believe that the primitives that we specified for fail-stop process failure can
> likely be extended to support other classes of errors without too much
> modification. But we will have to see as we go along how true that becomes.
>
>
> For Greg, this is the thread of conversation in which we will likely want to
> explore transitent failures and possibly even revive the discussion about
> performance degradation notification (probably performance notification is
> best saved for another thread). But thinking about modern failure detectors
> that provide an assurance range (accural failure detectors are what I am
> thinking of
> specifically) then the MPI may want to return an error indicating something
> symptomatic of the eventual diagnosis of a process as either fail-stop or
> transient.
>
>
> To Kathryn's point about the MPI_T proposal, so you are suggesting that we
> provide a query interface that the application can use to determine what
> types of errors can be handled and how the MPI implementation allows the
> application to continue after them? I think this is a good/necessary idea for
> tools support - so maybe as an extension to the MPI_T proposal. But I am
> hesitant to do so as the main mechanism for processing emerging errors. I
> think that applications would struggle with something that flexible since, it
> seems to imply that, they would have to adapt not to the error code, but to
> the error code and reported capabilities of the MPI implementation. So for
> the programmer they would need to have a check for the error code, a query
> for capabilities, and then a switch statement for all the possible actions that
> they can take. I like the flexibility, but I think it becomes too much of a
> burden on the programmer.
>
> I guess the counter argument would be that the MPI implementations
> should be, today, documenting all expected errors that they could return to
> the user. The MPI_T-like interface just provides a programatic way to access
> this information and react to it dynamically, instead of relying on
> documentation updates and a round of software updates. So this would fit
> nicely into the existing requirements for MPI implementations, though at
> some additional complexity to the end user.
>
> Interesting idea, if I understand it correctly. What do others think?
>
>
> -- Josh
>
>
> On Sun, Jun 12, 2011 at 12:09 PM, Kathryn Mohror <kathryn at llnl.gov> wrote:
> > Hi all,
> >
> > (I sent this earlier, but I don't think it went through because I sent
> > it from the wrong email address. I apologize if you get an extra
> > copy.)
> >
> > I know I haven't participated in this working group yet, so I may be
> > missing some context, but I couldn't resist putting my two cents in!
> >
> > I think that an MPI-centric approach is best. Otherwise, you run the
> > risk of defining a model that doesn't fit with a particular
> > implementation or machine and get shot down when it's brought to the
> > forum. For example, you may remember the PERUSE performance
> interface
> > that assumed a model of MPI that implementers didn't approve, because
> > it didn't fit their implementation or was difficult/expensive to
> > support. Now, to replace PERUSE, we've got the MPI_T interface which
> > doesn't specify
> > *anything* but appears to be supported by the forum.
> >
> > I agree though that having more specific error information when it's
> > available would be very useful. You might consider taking an approach
> > similar to MPI_T -- allow MPI implementers to define any specific
> > error codes they can/want and then provide an interface for decoding
> > and interpreting the errors.
> >
> > Of course, this approach may not be useful for most applications
> > directly, but I imagine that a fault-tolerant MPI application or  a
> > checkpoint/restart library could make use of the information, assuming
> > they could get at it.
> >
> > Kathryn
> >
> >
> >
> >
> > On 6/9/2011 8:20 AM, Howard Pritchard wrote:
> >>
> >> Hi Greg,
> >>
> >> I vote for an MPI-centric model.
> >>
> >> I also think that part of the job of MPI is to hide as much as
> >> possible things like 'exhaustion of network resources'
> >> and 'intermittent network failures'.  Indeed, the very first sentence
> >> in section 2.8 says "MPI provides the user with reliable message
> >> transmission".
> >>
> >> The only reason the topic came up yesterday was in the context of the
> >> fail-stop model and what types of error codes might be returned by
> >> MPI before the official verdict was in that a fail-stop had occurred.
> >> Several of us checked what our implementations might do prior to
> >> that, and it could include returning MPI_ERR_OTHER.  I could see how
> >> for someone writing a fault tolerant MPI application, something more
> >> useful than this rather ambiguous error code might be worth defining.
> >>
> >> Howard
> >>
> >>
> >> Bronevetsky, Greg wrote:
> >>>
> >>> I like the idea of having an abstract model of failures that can
> >>> approximate changes in system functionality due to failures.
> >>> However, I think before we go too far with this we should consider
> >>> the type of model we want to make. One option is to make a system
> >>> model that has as its basic elements nodes, network links and other
> >>> hardware components and identifies points in time when stop
> >>> functioning. The other option is to make it MPI-centric by talking
> >>> about the status of ranks and point-to-point communication between
> >>> them as well as communicators and collective communication over
> >>> them. So in the first type of model we can talk about network
> >>> resource exhaustion and in the latter we can talk about an intermittent
> inability to send messages over some or all communicators.
> >>>
> >>> I think that the MPI-centric model is a better option since it talks
> >>> exclusively  about entities that exist in MPI and ignores the
> >>> physical phenomena that cause a given type of degradation in
> functionality.
> >>>
> >>> The other question we need to discuss is the types of problems we
> >>> want to represent. We obviously care about fail-stop failures but
> >>> we're not talking about resource exhaustion. Do we want to add error
> >>> classes for transient errors and if so, what about performance
> slowdowns?
> >>>
> >>> Greg Bronevetsky
> >>> Lawrence Livermore National Lab
> >>> (925) 424-5756
> >>> bronevetsky at llnl.gov
> >>> http://greg.bronevetsky.com
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> >>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
> >>>> Sent: Wednesday, June 08, 2011 11:36 AM
> >>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> >>>> Group
> >>>> Subject: [Mpi3-ft] General Network Channel Failure
> >>>>
> >>>> It was mentioned in the conversation today that
> >>>> MPI_ERR_RANK_FAIL_STOP may not be the first error returned by an
> >>>> MPI call. In particular the MPI call may return an error
> >>>> symptomatic of a fail-stop process failure (e.g., network link
> >>>> failed - currently MPI_ERR_OTHER), before eventually diagnosing the
> >>>> event as a process failure. This 'space between' MPI_SUCCESS
> >>>> behavior and MPI_ERR_RANK_FAIL_STOP behavior is not currently
> >>>> defined, and probably should be for the application to cleanly move
> >>>> from set of semantics for one error class to another.
> >>>>
> >>>> The suggestion was to create a new general network error class
> >>>> (e.g., MPI_ERR_COMMUNICATION or MPI_ERR_NETWORK -
> MPI_ERR_COMM is
> >>>> taken) that can be returned when the operation cannot complete due
> >>>> to network issues (which might be later diagnosed as process
> >>>> failure and escalated to the MPI_ERR_RANK_FAIL_STOP semantics).
> You
> >>>> could also think about this error being used for network resource
> >>>> exhaustion as well (something that Tony mentioned during the last
> >>>> MPI Forum meeting). In which case retrying at a later time or
> >>>> taking some other action before trying again would be
> >>>> useful/expected.
> >>>>
> >>>> There are some issues with matching, and the implications on
> >>>> collective operations. If the network error is sticky/permanent
> >>>> then once the error is returned it will always be returned or
> >>>> escalated to fail-stop process failure (or more generally to a
> >>>> 'higher/more severe/more detailed' error class). A recovery
> >>>> proposal (similar to what we are developing for process
> >>>> failure)
> >>>> would allow the application to 'recover' the channel and continue
> >>>> communicating on it.
> >>>>
> >>>>
> >>>> The feeling was that this should be expanded into a full proposal,
> >>>> separate from the Run-Through Stabilization proposal. So we can
> >>>> continue with the RTS proposal, and bring this forward when it is
> >>>> ready.
> >>>>
> >>>>
> >>>> What to folks think about this idea?
> >>>>
> >>>> -- Josh
> >>>>
> >>>> --
> >>>> Joshua Hursey
> >>>> Postdoctoral Research Associate
> >>>> Oak Ridge National Laboratory
> >>>> http://users.nccs.gov/~jjhursey
> >>>> _______________________________________________
> >>>> mpi3-ft mailing list
> >>>> mpi3-ft at lists.mpi-forum.org
> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>> _______________________________________________
> >>> mpi3-ft mailing list
> >>> mpi3-ft at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> >
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft