[Mpi3-ft] General Network Channel Failure

Josh Hursey jjhursey at open-mpi.org
Wed Jun 15 10:08:29 CDT 2011


I agree that middleware will have to step up and help support an
application in handling failures. Further, the MPI specification for
handling faults should fully support middleware/libraries that want to
do this - in fact that is part of the mission of this group to define
the minimal interface required to do so.

My hesitation about a query interface is that it seems like it trends
towards not specifying anything in the MPI standard about expected
behavior after a failure - which could make it difficult/impossible to
use. If we only specify that a query interface for the programmer to
use to ask for what the MPI implementation can provide at the moment
after a failure, and then expect the application/middleware to adjust
dynamically to the set of operations available then this becomes
difficult to program - even for middleware.

The space of possible responses from the implementation may be to
large to enumerate. A query interface would have to categorize all
possible responses an MPI could give to its current capabilities (some
may not be known until the operation is called, or change over time).
For example, broadcast works after error X as long as it is from ranks
0-5 with datatype X and count no greater than Y. Or operation MPI_Foo
and MPI_Bar work but may not be used together in the same epoch.

The MPI_ERR_CANNOT_CONTINUE error class (currently dropped from the
RTS proposal) was a gesture towards the idea of partially supporting
the defined semantics. So if an MPI implementation can only provide
some but not all of the functionality required by the standard after a
particular failure, then it would return some variation of
MPI_ERR_CANNOT_CONTINUE for that operation - or that operation with
the specified arguments. It was not ideal, and maybe the query
interface in a natural evolution of a solution to a similar problem.


I guess I need some more specifics about what a query interface would
look like for a certain class of errors to better understand the use
cases. Do you want to start a wiki page with some details and
examples?

-- Josh


On Tue, Jun 14, 2011 at 1:10 PM, Kathryn Mohror <kathryn at llnl.gov> wrote:
> Hi Josh, all,
>
> On 6/13/2011 5:06 PM, Bronevetsky, Greg wrote:
>>
>> I like the idea of a query interface. This is primarily because I
>> believe that fault tolerance should be the responsibility of
>> middleware rather than applications. Writing scalable codes is
>> already hard and writing your own reliability solutions that are also
>> scalable and robust is almost certainly too much to ask for. However,
>> if we leave the job to middleware, we'll need to provide an interface
>> that is both portable and flexible.
>
> Yes, this is the idea I was going for -- that fault tolerance would be best
> handled by libraries/middleware rather than by applications directly. Then,
> applications don't need to focus on the complexities of dealing with a query
> interface for MPI implementation specific error handling. That said, I am
> not sure if this interface would belong in the MPI_T interface, since that
> was written with gathering/controlling performance in mind. However, if the
> intent is to have the interface be used primarily by tools (and rogue power
> application writers), then maybe the tools chapter is the best place for a
> query/control interface for fault tolerance, possibly under a different name
> space, MPI_FT?.
>
> Kathryn
>
>>
>> If we go for an MPI-level interface we get portability but we also
>> get complexity because MPI provides developers with a large number of
>> capabilities (send to/receive from a rank, collectively communicate
>> on communicator, read/write a file, etc.). A full API will need to
>> allow the application to perform queries to identify which
>> capabilities are available after a given fault. A decent API for this
>> might be to allow users to query for individual capabilities and also
>> allow for short-cut APIs. For example, the run-through API focuses on
>> fail-stop failures of one or more ranks. As such, if MPI reports that
>> a rank has failed, this implies that a fixed set of capabilities is
>> no longer available. We can extend this idea to other failure models
>> that virtualize sets of capabilities under a single name, meaning
>> that a problem with a single virtual capability (slow network or
>> corrupted file) implies a specific change to large number of
>> capabilities (all sends/receives are slow or read! s/writes to just
>> this file are invalid).
>>
>> Further, failure models themselves do not need to be specified in the
>> spec. The reason this might make sense is if we want to model various
>> levels using different abstractions. A simple way to represent
>> network failures is to treat the network as a monolithic entity. In
>> this model MPI_NETWORK_ERROR means that no communication can take
>> place. A more precise abstraction might treat the network as a set of
>> communication islands and one island can fail without causing errors
>> for communication on other islands. An even more precise model may
>> more individual rank-to-rank connections  and allow the user to
>> identify which connections are valid/invalid after a failure. All
>> these models are just different portable approximations of the real
>> system state and are useful for different applications. Right now I'm
>> nervous about choosing one to be in the spec while leaving the others
>> out. However, it should be pretty useful to force MPI to implement
>> some fairly detailed choice and leave the! higher-level ones to be
>> implemented outside the spec but provide a standard interface to
>> present them to users.
>>
>> Greg Bronevetsky Lawrence Livermore National Lab (925) 424-5756
>> bronevetsky at llnl.gov http://greg.bronevetsky.com
>>
>>
>>> -----Original Message----- From:
>>> mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey Sent: Monday,
>>> June 13, 2011 2:06 PM To: MPI 3.0 Fault Tolerance and Dynamic
>>> Process Control working Group Subject: Re: [Mpi3-ft] General
>>> Network Channel Failure
>>>
>>> Yeah the 'system model' is probably not going to fly with the MPI
>>> Forum. Mostly because we would be in the business then of defining
>>> what a 'node', 'NIC', 'RAM', etc. are - something the standard
>>> strives to not defined for future proofing.
>>>
>>> So I like the MPI model a bit more. For resource exhaustion we do
>>> get a bit of wiggle room with the 'reliable communication'
>>> requirement in 3.7 "If the call causes some system resource to be
>>> exhausted, then it will fail and return an error code." And for
>>> memory in 8.2 "The function MPI_ALLOC_MEM may return an error code
>>> of class MPI_ERR_NO_MEM to indicate it failed because memory is
>>> exhausted."
>>>
>>> So even though MPI shys away from defining network resources and
>>> memory, we might be able to find some acceptable language within
>>> the confines of these few error statements. So I think we can talk
>>> about connections between pairs (P2P) and groups (collectives) of
>>> processes, and maybe even connections to files (I/O). For each
>>> error class that we want to define we need to go through the
>>> process of asking "so if this error code was returned from this
>>> operation, what would that mean to the programmer? How do we want
>>> them to react to it?" Taking resource exhaustion as an example, for
>>> point-to-point messages if the application receives such an error
>>> then they may want to repost the message later. But if the same
>>> error is returned locally from a collective operation we need to
>>> determine if the user should repost the same collective or if the
>>> collective is guaranteed to return at all other processes with
>>> either success or some other error.
>>>
>>> I believe that the primitives that we specified for fail-stop
>>> process failure can likely be extended to support other classes of
>>> errors without too much modification. But we will have to see as we
>>> go along how true that becomes.
>>>
>>>
>>> For Greg, this is the thread of conversation in which we will
>>> likely want to explore transitent failures and possibly even revive
>>> the discussion about performance degradation notification (probably
>>> performance notification is best saved for another thread). But
>>> thinking about modern failure detectors that provide an assurance
>>> range (accural failure detectors are what I am thinking of
>>> specifically) then the MPI may want to return an error indicating
>>> something symptomatic of the eventual diagnosis of a process as
>>> either fail-stop or transient.
>>>
>>>
>>> To Kathryn's point about the MPI_T proposal, so you are suggesting
>>> that we provide a query interface that the application can use to
>>> determine what types of errors can be handled and how the MPI
>>> implementation allows the application to continue after them? I
>>> think this is a good/necessary idea for tools support - so maybe as
>>> an extension to the MPI_T proposal. But I am hesitant to do so as
>>> the main mechanism for processing emerging errors. I think that
>>> applications would struggle with something that flexible since, it
>>> seems to imply that, they would have to adapt not to the error
>>> code, but to the error code and reported capabilities of the MPI
>>> implementation. So for the programmer they would need to have a
>>> check for the error code, a query for capabilities, and then a
>>> switch statement for all the possible actions that they can take. I
>>> like the flexibility, but I think it becomes too much of a burden
>>> on the programmer.
>>>
>>> I guess the counter argument would be that the MPI implementations
>>> should be, today, documenting all expected errors that they could
>>> return to the user. The MPI_T-like interface just provides a
>>> programatic way to access this information and react to it
>>> dynamically, instead of relying on documentation updates and a
>>> round of software updates. So this would fit nicely into the
>>> existing requirements for MPI implementations, though at some
>>> additional complexity to the end user.
>>>
>>> Interesting idea, if I understand it correctly. What do others
>>> think?
>>>
>>>
>>> -- Josh
>>>
>>>
>>> On Sun, Jun 12, 2011 at 12:09 PM, Kathryn Mohror<kathryn at llnl.gov>
>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> (I sent this earlier, but I don't think it went through because I
>>>> sent it from the wrong email address. I apologize if you get an
>>>> extra copy.)
>>>>
>>>> I know I haven't participated in this working group yet, so I may
>>>> be missing some context, but I couldn't resist putting my two
>>>> cents in!
>>>>
>>>> I think that an MPI-centric approach is best. Otherwise, you run
>>>> the risk of defining a model that doesn't fit with a particular
>>>> implementation or machine and get shot down when it's brought to
>>>> the forum. For example, you may remember the PERUSE performance
>>>
>>> interface
>>>>
>>>> that assumed a model of MPI that implementers didn't approve,
>>>> because it didn't fit their implementation or was
>>>> difficult/expensive to support. Now, to replace PERUSE, we've got
>>>> the MPI_T interface which doesn't specify *anything* but appears
>>>> to be supported by the forum.
>>>>
>>>> I agree though that having more specific error information when
>>>> it's available would be very useful. You might consider taking an
>>>> approach similar to MPI_T -- allow MPI implementers to define any
>>>> specific error codes they can/want and then provide an interface
>>>> for decoding and interpreting the errors.
>>>>
>>>> Of course, this approach may not be useful for most applications
>>>> directly, but I imagine that a fault-tolerant MPI application or
>>>> a checkpoint/restart library could make use of the information,
>>>> assuming they could get at it.
>>>>
>>>> Kathryn
>>>>
>>>>
>>>>
>>>>
>>>> On 6/9/2011 8:20 AM, Howard Pritchard wrote:
>>>>>
>>>>> Hi Greg,
>>>>>
>>>>> I vote for an MPI-centric model.
>>>>>
>>>>> I also think that part of the job of MPI is to hide as much as
>>>>> possible things like 'exhaustion of network resources' and
>>>>> 'intermittent network failures'.  Indeed, the very first
>>>>> sentence in section 2.8 says "MPI provides the user with
>>>>> reliable message transmission".
>>>>>
>>>>> The only reason the topic came up yesterday was in the context
>>>>> of the fail-stop model and what types of error codes might be
>>>>> returned by MPI before the official verdict was in that a
>>>>> fail-stop had occurred. Several of us checked what our
>>>>> implementations might do prior to that, and it could include
>>>>> returning MPI_ERR_OTHER.  I could see how for someone writing a
>>>>> fault tolerant MPI application, something more useful than this
>>>>> rather ambiguous error code might be worth defining.
>>>>>
>>>>> Howard
>>>>>
>>>>>
>>>>> Bronevetsky, Greg wrote:
>>>>>>
>>>>>> I like the idea of having an abstract model of failures that
>>>>>> can approximate changes in system functionality due to
>>>>>> failures. However, I think before we go too far with this we
>>>>>> should consider the type of model we want to make. One option
>>>>>> is to make a system model that has as its basic elements
>>>>>> nodes, network links and other hardware components and
>>>>>> identifies points in time when stop functioning. The other
>>>>>> option is to make it MPI-centric by talking about the status
>>>>>> of ranks and point-to-point communication between them as
>>>>>> well as communicators and collective communication over them.
>>>>>> So in the first type of model we can talk about network
>>>>>> resource exhaustion and in the latter we can talk about an
>>>>>> intermittent
>>>
>>> inability to send messages over some or all communicators.
>>>>>>
>>>>>> I think that the MPI-centric model is a better option since
>>>>>> it talks exclusively  about entities that exist in MPI and
>>>>>> ignores the physical phenomena that cause a given type of
>>>>>> degradation in
>>>
>>> functionality.
>>>>>>
>>>>>> The other question we need to discuss is the types of
>>>>>> problems we want to represent. We obviously care about
>>>>>> fail-stop failures but we're not talking about resource
>>>>>> exhaustion. Do we want to add error classes for transient
>>>>>> errors and if so, what about performance
>>>
>>> slowdowns?
>>>>>>
>>>>>> Greg Bronevetsky Lawrence Livermore National Lab (925)
>>>>>> 424-5756 bronevetsky at llnl.gov http://greg.bronevetsky.com
>>>>>>
>>>>>>
>>>>>>> -----Original Message----- From:
>>>>>>> mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>>>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey Sent:
>>>>>>> Wednesday, June 08, 2011 11:36 AM To: MPI 3.0 Fault
>>>>>>> Tolerance and Dynamic Process Control working Group
>>>>>>> Subject: [Mpi3-ft] General Network Channel Failure
>>>>>>>
>>>>>>> It was mentioned in the conversation today that
>>>>>>> MPI_ERR_RANK_FAIL_STOP may not be the first error returned
>>>>>>> by an MPI call. In particular the MPI call may return an
>>>>>>> error symptomatic of a fail-stop process failure (e.g.,
>>>>>>> network link failed - currently MPI_ERR_OTHER), before
>>>>>>> eventually diagnosing the event as a process failure. This
>>>>>>> 'space between' MPI_SUCCESS behavior and
>>>>>>> MPI_ERR_RANK_FAIL_STOP behavior is not currently defined,
>>>>>>> and probably should be for the application to cleanly move
>>>>>>> from set of semantics for one error class to another.
>>>>>>>
>>>>>>> The suggestion was to create a new general network error
>>>>>>> class (e.g., MPI_ERR_COMMUNICATION or MPI_ERR_NETWORK -
>>>
>>> MPI_ERR_COMM is
>>>>>>>
>>>>>>> taken) that can be returned when the operation cannot
>>>>>>> complete due to network issues (which might be later
>>>>>>> diagnosed as process failure and escalated to the
>>>>>>> MPI_ERR_RANK_FAIL_STOP semantics).
>>>
>>> You
>>>>>>>
>>>>>>> could also think about this error being used for network
>>>>>>> resource exhaustion as well (something that Tony mentioned
>>>>>>> during the last MPI Forum meeting). In which case retrying
>>>>>>> at a later time or taking some other action before trying
>>>>>>> again would be useful/expected.
>>>>>>>
>>>>>>> There are some issues with matching, and the implications
>>>>>>> on collective operations. If the network error is
>>>>>>> sticky/permanent then once the error is returned it will
>>>>>>> always be returned or escalated to fail-stop process
>>>>>>> failure (or more generally to a 'higher/more severe/more
>>>>>>> detailed' error class). A recovery proposal (similar to
>>>>>>> what we are developing for process failure) would allow the
>>>>>>> application to 'recover' the channel and continue
>>>>>>> communicating on it.
>>>>>>>
>>>>>>>
>>>>>>> The feeling was that this should be expanded into a full
>>>>>>> proposal, separate from the Run-Through Stabilization
>>>>>>> proposal. So we can continue with the RTS proposal, and
>>>>>>> bring this forward when it is ready.
>>>>>>>
>>>>>>>
>>>>>>> What to folks think about this idea?
>>>>>>>
>>>>>>> -- Josh
>>>>>>>
>>>>>>> -- Joshua Hursey Postdoctoral Research Associate Oak Ridge
>>>>>>> National Laboratory http://users.nccs.gov/~jjhursey
>>>>>>> _______________________________________________ mpi3-ft
>>>>>>> mailing list mpi3-ft at lists.mpi-forum.org
>>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>
>>>>>> _______________________________________________ mpi3-ft
>>>>>> mailing list mpi3-ft at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>
>>>>>
>>>> _______________________________________________ mpi3-ft mailing
>>>> list mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>>
>>>
>>>
>>>
>>> -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National
>>> Laboratory http://users.nccs.gov/~jjhursey
>>>
>>> _______________________________________________ mpi3-ft mailing
>>> list mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>> _______________________________________________ mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




More information about the mpiwg-ft mailing list