[Mpi3-ft] General Network Channel Failure

Josh Hursey jjhursey at open-mpi.org
Mon Jun 13 16:05:50 CDT 2011

Yeah the 'system model' is probably not going to fly with the MPI
Forum. Mostly because we would be in the business then of defining
what a 'node', 'NIC', 'RAM', etc. are - something the standard strives
to not defined for future proofing.

So I like the MPI model a bit more. For resource exhaustion we do get
a bit of wiggle room with the 'reliable communication' requirement in
3.7 "If the call causes some system resource to be exhausted, then it
will fail and return an error code." And for memory in 8.2 "The
function MPI_ALLOC_MEM may return an error code of class
MPI_ERR_NO_MEM to indicate it failed because memory is exhausted."

So even though MPI shys away from defining network resources and
memory, we might be able to find some acceptable language within the
confines of these few error statements. So I think we can talk about
connections between pairs (P2P) and groups (collectives) of processes,
and maybe even connections to files (I/O). For each error class that
we want to define we need to go through the process of asking "so if
this error code was returned from this operation, what would that mean
to the programmer? How do we want them to react to it?" Taking
resource exhaustion as an example, for point-to-point messages if the
application receives such an error then they may want to repost the
message later. But if the same error is returned locally from a
collective operation we need to determine if the user should repost
the same collective or if the collective is guaranteed to return at
all other processes with either success or some other error.

I believe that the primitives that we specified for fail-stop process
failure can likely be extended to support other classes of errors
without too much modification. But we will have to see as we go along
how true that becomes.

For Greg, this is the thread of conversation in which we will likely
want to explore transitent failures and possibly even revive the
discussion about performance degradation notification (probably
performance notification is best saved for another thread). But
thinking about modern failure detectors that provide an assurance
range (accural failure detectors are what I am thinking of
specifically) then the MPI may want to return an error indicating
something symptomatic of the eventual diagnosis of a process as either
fail-stop or transient.

To Kathryn's point about the MPI_T proposal, so you are suggesting
that we provide a query interface that the application can use to
determine what types of errors can be handled and how the MPI
implementation allows the application to continue after them? I think
this is a good/necessary idea for tools support - so maybe as an
extension to the MPI_T proposal. But I am hesitant to do so as the
main mechanism for processing emerging errors. I think that
applications would struggle with something that flexible since, it
seems to imply that, they would have to adapt not to the error code,
but to the error code and reported capabilities of the MPI
implementation. So for the programmer they would need to have a check
for the error code, a query for capabilities, and then a switch
statement for all the possible actions that they can take. I like the
flexibility, but I think it becomes too much of a burden on the

I guess the counter argument would be that the MPI implementations
should be, today, documenting all expected errors that they could
return to the user. The MPI_T-like interface just provides a
programatic way to access this information and react to it
dynamically, instead of relying on documentation updates and a round
of software updates. So this would fit nicely into the existing
requirements for MPI implementations, though at some additional
complexity to the end user.

Interesting idea, if I understand it correctly. What do others think?

-- Josh

On Sun, Jun 12, 2011 at 12:09 PM, Kathryn Mohror <kathryn at llnl.gov> wrote:
> Hi all,
> (I sent this earlier, but I don't think it went through because I sent it
> from the wrong email address. I apologize if you get an extra copy.)
> I know I haven't participated in this working group yet, so I may be
> missing some context, but I couldn't resist putting my two cents in!
> I think that an MPI-centric approach is best. Otherwise, you run the
> risk of defining a model that doesn't fit with a particular
> implementation or machine and get shot down when it's brought to the
> forum. For example, you may remember the PERUSE performance interface
> that assumed a model of MPI that implementers didn't approve, because it
> didn't fit their implementation or was difficult/expensive to support. Now,
> to replace PERUSE, we've got the MPI_T interface which doesn't specify
> *anything* but appears to be supported by the forum.
> I agree though that having more specific error information when it's
> available would be very useful. You might consider taking an approach
> similar to MPI_T -- allow MPI implementers to define any specific error
> codes they can/want and then provide an interface for decoding and
> interpreting the errors.
> Of course, this approach may not be useful for most applications
> directly, but I imagine that a fault-tolerant MPI application or  a
> checkpoint/restart library could make use of the information, assuming
> they could get at it.
> Kathryn
> On 6/9/2011 8:20 AM, Howard Pritchard wrote:
>> Hi Greg,
>> I vote for an MPI-centric model.
>> I also think that part of the job of MPI is to hide as much
>> as possible things like 'exhaustion of network resources'
>> and 'intermittent network failures'.  Indeed, the very first
>> sentence in section 2.8 says "MPI provides the user with
>> reliable message transmission".
>> The only reason the topic came up yesterday was in the
>> context of the fail-stop model and what types of error
>> codes might be returned by MPI before the official
>> verdict was in that a fail-stop had occurred.  Several of
>> us checked what our implementations might do prior to
>> that, and it could include returning MPI_ERR_OTHER.  I
>> could see how for someone writing a fault tolerant MPI
>> application, something more useful than this rather ambiguous
>> error code might be worth defining.
>> Howard
>> Bronevetsky, Greg wrote:
>>> I like the idea of having an abstract model of failures that can
>>> approximate changes in system functionality due to failures. However, I
>>> think before we go too far with this we should consider the type of model we
>>> want to make. One option is to make a system model that has as its basic
>>> elements nodes, network links and other hardware components and identifies
>>> points in time when stop functioning. The other option is to make it
>>> MPI-centric by talking about the status of ranks and point-to-point
>>> communication between them as well as communicators and collective
>>> communication over them. So in the first type of model we can talk about
>>> network resource exhaustion and in the latter we can talk about an
>>> intermittent inability to send messages over some or all communicators.
>>> I think that the MPI-centric model is a better option since it talks
>>> exclusively  about entities that exist in MPI and ignores the physical
>>> phenomena that cause a given type of degradation in functionality.
>>> The other question we need to discuss is the types of problems we want to
>>> represent. We obviously care about fail-stop failures but we're not talking
>>> about resource exhaustion. Do we want to add error classes for transient
>>> errors and if so, what about performance slowdowns?
>>> Greg Bronevetsky
>>> Lawrence Livermore National Lab
>>> (925) 424-5756
>>> bronevetsky at llnl.gov
>>> http://greg.bronevetsky.com
>>>> -----Original Message-----
>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
>>>> Sent: Wednesday, June 08, 2011 11:36 AM
>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>> Subject: [Mpi3-ft] General Network Channel Failure
>>>> It was mentioned in the conversation today that MPI_ERR_RANK_FAIL_STOP
>>>> may not be the first error returned by an MPI call. In particular the
>>>> MPI call
>>>> may return an error symptomatic of a fail-stop process failure (e.g.,
>>>> network
>>>> link failed - currently MPI_ERR_OTHER), before eventually diagnosing the
>>>> event as a process failure. This 'space between' MPI_SUCCESS behavior
>>>> and
>>>> MPI_ERR_RANK_FAIL_STOP behavior is not currently defined, and probably
>>>> should be for the application to cleanly move from set of semantics for
>>>> one
>>>> error class to another.
>>>> The suggestion was to create a new general network error class (e.g.,
>>>> taken) that can be returned when the operation cannot complete due to
>>>> network issues (which might be later diagnosed as process failure and
>>>> escalated to the MPI_ERR_RANK_FAIL_STOP semantics). You could also think
>>>> about this error being used for network resource exhaustion as well
>>>> (something that Tony mentioned during the last MPI Forum meeting). In
>>>> which case retrying at a later time or taking some other action before
>>>> trying
>>>> again would be useful/expected.
>>>> There are some issues with matching, and the implications on collective
>>>> operations. If the network error is sticky/permanent then once the error
>>>> is
>>>> returned it will always be returned or escalated to fail-stop process
>>>> failure (or
>>>> more generally to a 'higher/more severe/more detailed' error class). A
>>>> recovery proposal (similar to what we are developing for process
>>>> failure)
>>>> would allow the application to 'recover' the channel and continue
>>>> communicating on it.
>>>> The feeling was that this should be expanded into a full proposal,
>>>> separate
>>>> from the Run-Through Stabilization proposal. So we can continue with the
>>>> RTS proposal, and bring this forward when it is ready.
>>>> What to folks think about this idea?
>>>> -- Josh
>>>> --
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://users.nccs.gov/~jjhursey
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list