[Mpi3-ft] "Error" to "Fault" changes

Josh Hursey jjhursey at open-mpi.org
Tue Jun 28 17:11:08 CDT 2011

First thanks for taking a first pass at this. Your changes got be
thinking about what you/we mean by 'faults' versus 'errors' and how
they should be used in the document. This lead me down the path of
researching terminology to see if I can find precise, agreed upon
definitions of both.

I found a bunch of papers, but the three cited at bottom do a good job
of summarizing what the IEEE seems to have agreed upon. It is a bit
difficult to define just one piece without all of the supporting
definitions (e.g., what is a 'system' or 'service'), but I took a stab
at it below:

An error is the deviation of expected behavior from correct operation
of the system (e.g., MPI library, MPI operation). Errors are caused by
faults in one or more components of the system (e.g., memory
corruption, network link failure).

A failure is when the intended function of the system (e.g., MPI
library, MPI operation) cannot be delivered because of one or more

A couple of interesting quotes from the papers:
 [2] "Fault tolerance means to avoid service failures in the presence
of faults."
 [3] "Fault Tolerance is carried out by error processing, which may be
automatic or operator-assisted."

So I am going back and forth on whether I like the changes from
'errors' to 'faults' in this document. Since faults may be latent
until they are encountered when they become errors. If the error leads
to an unexpected state then it is termed a failure. So in these
changes are we referring to the 'fault', which the user may not have
access to, or the 'error' that results from the fault that the user
may be affected by.

For this big of a terminology change throughout the document we should
have a solid foundation that we are basing it off of. But I keep going
back and forth in my mind trying to figure out which is the 'right'
way - what is there vs. what is being proposed.

Sorry Darius, this is not really the type of feedback that you were
probably expecting.

What do others think about all this?

-- Josh

[1] "Defect, Fault, Error,..., or Failure?", 1997

[2] "Basic concepts and taxonomy of dependable and secure computing", 2004.


On Mon, Jun 27, 2011 at 5:59 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> Here are the changes I made.  I've labeled them with ticket999 until we decide how to proceed.  I've included a diff and a pdf. In the pdf, just search for ticket999.
> Please let me know what you think.
> Thanks,
> -d
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list