[Mpi3-ft] General Network Channel Failure

Kathryn Mohror kathryn at llnl.gov
Mon Jun 27 11:51:09 CDT 2011


Hi Josh,

> I guess I need some more specifics about what a query interface would
> look like for a certain class of errors to better understand the use
> cases. Do you want to start a wiki page with some details and
> examples?

Yes, generally speaking, I am interested in working on this, but I don't 
have time to devote to it in the near future. Possibly I'll have time to 
think on it after MPI_T is completed.

Kathryn

>
> -- Josh
>
>
> On Tue, Jun 14, 2011 at 1:10 PM, Kathryn Mohror<kathryn at llnl.gov>  wrote:
>> Hi Josh, all,
>>
>> On 6/13/2011 5:06 PM, Bronevetsky, Greg wrote:
>>>
>>> I like the idea of a query interface. This is primarily because I
>>> believe that fault tolerance should be the responsibility of
>>> middleware rather than applications. Writing scalable codes is
>>> already hard and writing your own reliability solutions that are also
>>> scalable and robust is almost certainly too much to ask for. However,
>>> if we leave the job to middleware, we'll need to provide an interface
>>> that is both portable and flexible.
>>
>> Yes, this is the idea I was going for -- that fault tolerance would be best
>> handled by libraries/middleware rather than by applications directly. Then,
>> applications don't need to focus on the complexities of dealing with a query
>> interface for MPI implementation specific error handling. That said, I am
>> not sure if this interface would belong in the MPI_T interface, since that
>> was written with gathering/controlling performance in mind. However, if the
>> intent is to have the interface be used primarily by tools (and rogue power
>> application writers), then maybe the tools chapter is the best place for a
>> query/control interface for fault tolerance, possibly under a different name
>> space, MPI_FT?.
>>
>> Kathryn
>>
>>>
>>> If we go for an MPI-level interface we get portability but we also
>>> get complexity because MPI provides developers with a large number of
>>> capabilities (send to/receive from a rank, collectively communicate
>>> on communicator, read/write a file, etc.). A full API will need to
>>> allow the application to perform queries to identify which
>>> capabilities are available after a given fault. A decent API for this
>>> might be to allow users to query for individual capabilities and also
>>> allow for short-cut APIs. For example, the run-through API focuses on
>>> fail-stop failures of one or more ranks. As such, if MPI reports that
>>> a rank has failed, this implies that a fixed set of capabilities is
>>> no longer available. We can extend this idea to other failure models
>>> that virtualize sets of capabilities under a single name, meaning
>>> that a problem with a single virtual capability (slow network or
>>> corrupted file) implies a specific change to large number of
>>> capabilities (all sends/receives are slow or read! s/writes to just
>>> this file are invalid).
>>>
>>> Further, failure models themselves do not need to be specified in the
>>> spec. The reason this might make sense is if we want to model various
>>> levels using different abstractions. A simple way to represent
>>> network failures is to treat the network as a monolithic entity. In
>>> this model MPI_NETWORK_ERROR means that no communication can take
>>> place. A more precise abstraction might treat the network as a set of
>>> communication islands and one island can fail without causing errors
>>> for communication on other islands. An even more precise model may
>>> more individual rank-to-rank connections  and allow the user to
>>> identify which connections are valid/invalid after a failure. All
>>> these models are just different portable approximations of the real
>>> system state and are useful for different applications. Right now I'm
>>> nervous about choosing one to be in the spec while leaving the others
>>> out. However, it should be pretty useful to force MPI to implement
>>> some fairly detailed choice and leave the! higher-level ones to be
>>> implemented outside the spec but provide a standard interface to
>>> present them to users.
>>>
>>> Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756
>>> bronevetsky at llnl.gov http://greg.bronevetsky.com
>>>
>>>
>>>> -----Original Message----- From:
>>>> mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey Sent: Monday,
>>>> June 13, 2011 2:06 PM To: MPI 3.0 Fault Tolerance and Dynamic
>>>> Process Control working Group Subject: Re: [Mpi3-ft] General
>>>> Network Channel Failure
>>>>
>>>> Yeah the 'system model' is probably not going to fly with the MPI
>>>> Forum. Mostly because we would be in the business then of defining
>>>> what a 'node', 'NIC', 'RAM', etc. are - something the standard
>>>> strives to not defined for future proofing.
>>>>
>>>> So I like the MPI model a bit more. For resource exhaustion we do
>>>> get a bit of wiggle room with the 'reliable communication'
>>>> requirement in 3.7 "If the call causes some system resource to be
>>>> exhausted, then it will fail and return an error code." And for
>>>> memory in 8.2 "The function MPI_ALLOC_MEM may return an error code
>>>> of class MPI_ERR_NO_MEM to indicate it failed because memory is
>>>> exhausted."
>>>>
>>>> So even though MPI shys away from defining network resources and
>>>> memory, we might be able to find some acceptable language within
>>>> the confines of these few error statements. So I think we can talk
>>>> about connections between pairs (P2P) and groups (collectives) of
>>>> processes, and maybe even connections to files (I/O). For each
>>>> error class that we want to define we need to go through the
>>>> process of asking "so if this error code was returned from this
>>>> operation, what would that mean to the programmer? How do we want
>>>> them to react to it?" Taking resource exhaustion as an example, for
>>>> point-to-point messages if the application receives such an error
>>>> then they may want to repost the message later. But if the same
>>>> error is returned locally from a collective operation we need to
>>>> determine if the user should repost the same collective or if the
>>>> collective is guaranteed to return at all other processes with
>>>> either success or some other error.
>>>>
>>>> I believe that the primitives that we specified for fail-stop
>>>> process failure can likely be extended to support other classes of
>>>> errors without too much modification. But we will have to see as we
>>>> go along how true that becomes.
>>>>
>>>>
>>>> For Greg, this is the thread of conversation in which we will
>>>> likely want to explore transitent failures and possibly even revive
>>>> the discussion about performance degradation notification (probably
>>>> performance notification is best saved for another thread). But
>>>> thinking about modern failure detectors that provide an assurance
>>>> range (accural failure detectors are what I am thinking of
>>>> specifically) then the MPI may want to return an error indicating
>>>> something symptomatic of the eventual diagnosis of a process as
>>>> either fail-stop or transient.
>>>>
>>>>
>>>> To Kathryn's point about the MPI_T proposal, so you are suggesting
>>>> that we provide a query interface that the application can use to
>>>> determine what types of errors can be handled and how the MPI
>>>> implementation allows the application to continue after them? I
>>>> think this is a good/necessary idea for tools support - so maybe as
>>>> an extension to the MPI_T proposal. But I am hesitant to do so as
>>>> the main mechanism for processing emerging errors. I think that
>>>> applications would struggle with something that flexible since, it
>>>> seems to imply that, they would have to adapt not to the error
>>>> code, but to the error code and reported capabilities of the MPI
>>>> implementation. So for the programmer they would need to have a
>>>> check for the error code, a query for capabilities, and then a
>>>> switch statement for all the possible actions that they can take. I
>>>> like the flexibility, but I think it becomes too much of a burden
>>>> on the programmer.
>>>>
>>>> I guess the counter argument would be that the MPI implementations
>>>> should be, today, documenting all expected errors that they could
>>>> return to the user. The MPI_T-like interface just provides a
>>>> programatic way to access this information and react to it
>>>> dynamically, instead of relying on documentation updates and a
>>>> round of software updates. So this would fit nicely into the
>>>> existing requirements for MPI implementations, though at some
>>>> additional complexity to the end user.
>>>>
>>>> Interesting idea, if I understand it correctly. What do others
>>>> think?
>>>>
>>>>
>>>> -- Josh
>>>>
>>>>
>>>> On Sun, Jun 12, 2011 at 12:09 PM, Kathryn Mohror<kathryn at llnl.gov>
>>>> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> (I sent this earlier, but I don't think it went through because I
>>>>> sent it from the wrong email address. I apologize if you get an
>>>>> extra copy.)
>>>>>
>>>>> I know I haven't participated in this working group yet, so I may
>>>>> be missing some context, but I couldn't resist putting my two
>>>>> cents in!
>>>>>
>>>>> I think that an MPI-centric approach is best. Otherwise, you run
>>>>> the risk of defining a model that doesn't fit with a particular
>>>>> implementation or machine and get shot down when it's brought to
>>>>> the forum. For example, you may remember the PERUSE performance
>>>>
>>>> interface
>>>>>
>>>>> that assumed a model of MPI that implementers didn't approve,
>>>>> because it didn't fit their implementation or was
>>>>> difficult/expensive to support. Now, to replace PERUSE, we've got
>>>>> the MPI_T interface which doesn't specify *anything* but appears
>>>>> to be supported by the forum.
>>>>>
>>>>> I agree though that having more specific error information when
>>>>> it's available would be very useful. You might consider taking an
>>>>> approach similar to MPI_T -- allow MPI implementers to define any
>>>>> specific error codes they can/want and then provide an interface
>>>>> for decoding and interpreting the errors.
>>>>>
>>>>> Of course, this approach may not be useful for most applications
>>>>> directly, but I imagine that a fault-tolerant MPI application or
>>>>> a checkpoint/restart library could make use of the information,
>>>>> assuming they could get at it.
>>>>>
>>>>> Kathryn
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6/9/2011 8:20 AM, Howard Pritchard wrote:
>>>>>>
>>>>>> Hi Greg,
>>>>>>
>>>>>> I vote for an MPI-centric model.
>>>>>>
>>>>>> I also think that part of the job of MPI is to hide as much as
>>>>>> possible things like 'exhaustion of network resources' and
>>>>>> 'intermittent network failures'.  Indeed, the very first
>>>>>> sentence in section 2.8 says "MPI provides the user with
>>>>>> reliable message transmission".
>>>>>>
>>>>>> The only reason the topic came up yesterday was in the context
>>>>>> of the fail-stop model and what types of error codes might be
>>>>>> returned by MPI before the official verdict was in that a
>>>>>> fail-stop had occurred. Several of us checked what our
>>>>>> implementations might do prior to that, and it could include
>>>>>> returning MPI_ERR_OTHER.  I could see how for someone writing a
>>>>>> fault tolerant MPI application, something more useful than this
>>>>>> rather ambiguous error code might be worth defining.
>>>>>>
>>>>>> Howard
>>>>>>
>>>>>>
>>>>>> Bronevetsky, Greg wrote:
>>>>>>>
>>>>>>> I like the idea of having an abstract model of failures that
>>>>>>> can approximate changes in system functionality due to
>>>>>>> failures. However, I think before we go too far with this we
>>>>>>> should consider the type of model we want to make. One option
>>>>>>> is to make a system model that has as its basic elements
>>>>>>> nodes, network links and other hardware components and
>>>>>>> identifies points in time when stop functioning. The other
>>>>>>> option is to make it MPI-centric by talking about the status
>>>>>>> of ranks and point-to-point communication between them as
>>>>>>> well as communicators and collective communication over them.
>>>>>>> So in the first type of model we can talk about network
>>>>>>> resource exhaustion and in the latter we can talk about an
>>>>>>> intermittent
>>>>
>>>> inability to send messages over some or all communicators.
>>>>>>>
>>>>>>> I think that the MPI-centric model is a better option since
>>>>>>> it talks exclusively  about entities that exist in MPI and
>>>>>>> ignores the physical phenomena that cause a given type of
>>>>>>> degradation in
>>>>
>>>> functionality.
>>>>>>>
>>>>>>> The other question we need to discuss is the types of
>>>>>>> problems we want to represent. We obviously care about
>>>>>>> fail-stop failures but we're not talking about resource
>>>>>>> exhaustion. Do we want to add error classes for transient
>>>>>>> errors and if so, what about performance
>>>>
>>>> slowdowns?
>>>>>>>
>>>>>>> Greg Bronevetsky Lawrence Livermore National Lab (925)
>>>>>>> 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message----- From:
>>>>>>>> mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>>>>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey Sent:
>>>>>>>> Wednesday, June 08, 2011 11:36 AM To: MPI 3.0 Fault
>>>>>>>> Tolerance and Dynamic Process Control working Group
>>>>>>>> Subject: [Mpi3-ft] General Network Channel Failure
>>>>>>>>
>>>>>>>> It was mentioned in the conversation today that
>>>>>>>> MPI_ERR_RANK_FAIL_STOP may not be the first error returned
>>>>>>>> by an MPI call. In particular the MPI call may return an
>>>>>>>> error symptomatic of a fail-stop process failure (e.g.,
>>>>>>>> network link failed - currently MPI_ERR_OTHER), before
>>>>>>>> eventually diagnosing the event as a process failure. This
>>>>>>>> 'space between' MPI_SUCCESS behavior and
>>>>>>>> MPI_ERR_RANK_FAIL_STOP behavior is not currently defined,
>>>>>>>> and probably should be for the application to cleanly move
>>>>>>>> from set of semantics for one error class to another.
>>>>>>>>
>>>>>>>> The suggestion was to create a new general network error
>>>>>>>> class (e.g., MPI_ERR_COMMUNICATION or MPI_ERR_NETWORK -
>>>>
>>>> MPI_ERR_COMM is
>>>>>>>>
>>>>>>>> taken) that can be returned when the operation cannot
>>>>>>>> complete due to network issues (which might be later
>>>>>>>> diagnosed as process failure and escalated to the
>>>>>>>> MPI_ERR_RANK_FAIL_STOP semantics).
>>>>
>>>> You
>>>>>>>>
>>>>>>>> could also think about this error being used for network
>>>>>>>> resource exhaustion as well (something that Tony mentioned
>>>>>>>> during the last MPI Forum meeting). In which case retrying
>>>>>>>> at a later time or taking some other action before trying
>>>>>>>> again would be useful/expected.
>>>>>>>>
>>>>>>>> There are some issues with matching, and the implications
>>>>>>>> on collective operations. If the network error is
>>>>>>>> sticky/permanent then once the error is returned it will
>>>>>>>> always be returned or escalated to fail-stop process
>>>>>>>> failure (or more generally to a 'higher/more severe/more
>>>>>>>> detailed' error class). A recovery proposal (similar to
>>>>>>>> what we are developing for process failure) would allow the
>>>>>>>> application to 'recover' the channel and continue
>>>>>>>> communicating on it.
>>>>>>>>
>>>>>>>>
>>>>>>>> The feeling was that this should be expanded into a full
>>>>>>>> proposal, separate from the Run-Through Stabilization
>>>>>>>> proposal. So we can continue with the RTS proposal, and
>>>>>>>> bring this forward when it is ready.
>>>>>>>>
>>>>>>>>
>>>>>>>> What to folks think about this idea?
>>>>>>>>
>>>>>>>> -- Josh
>>>>>>>>
>>>>>>>> -- Joshua Hursey Postdoctoral Research Associate Oak Ridge
>>>>>>>> National Laboratory http://users.nccs.gov/~jjhursey
>>>>>>>> _______________________________________________ mpi3-ft
>>>>>>>> mailing list mpi3-ft at lists.mpi-forum.org
>>>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>>
>>>>>>> _______________________________________________ mpi3-ft
>>>>>>> mailing list mpi3-ft at lists.mpi-forum.org
>>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>
>>>>>>
>>>>> _______________________________________________ mpi3-ft mailing
>>>>> list mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National
>>>> Laboratory http://users.nccs.gov/~jjhursey
>>>>
>>>> _______________________________________________ mpi3-ft mailing
>>>> list mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>> _______________________________________________ mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft



More information about the mpiwg-ft mailing list