[Mpi3-ft] "Error" to "Fault" changes

Darius Buntinas buntinas at mcs.anl.gov
Tue Jun 28 17:39:43 CDT 2011

Actually, this is the kind of feedback I hoped for. :-)  I agree that we need to get it right.

I believe that we can replace "error" with "error code" (or "error class," as appropriate) almost everywhere in the changed document (the one I sent out).  That might make it easier to understand wrt the "error" definition below.

On the other hand, there were places in the standard where "exception" was used.  We can define an exception as a detected error.  So:  A fault might result in an error.  If MPI detects an error, it will raise an exception.

Does that help any?


On Jun 28, 2011, at 5:11 PM, Josh Hursey wrote:

> First thanks for taking a first pass at this. Your changes got be
> thinking about what you/we mean by 'faults' versus 'errors' and how
> they should be used in the document. This lead me down the path of
> researching terminology to see if I can find precise, agreed upon
> definitions of both.
> I found a bunch of papers, but the three cited at bottom do a good job
> of summarizing what the IEEE seems to have agreed upon. It is a bit
> difficult to define just one piece without all of the supporting
> definitions (e.g., what is a 'system' or 'service'), but I took a stab
> at it below:
> Error:
> ------
> An error is the deviation of expected behavior from correct operation
> of the system (e.g., MPI library, MPI operation). Errors are caused by
> faults in one or more components of the system (e.g., memory
> corruption, network link failure).
> Failure:
> --------
> A failure is when the intended function of the system (e.g., MPI
> library, MPI operation) cannot be delivered because of one or more
> errors.
> A couple of interesting quotes from the papers:
> [2] "Fault tolerance means to avoid service failures in the presence
> of faults."
> [3] "Fault Tolerance is carried out by error processing, which may be
> automatic or operator-assisted."
> So I am going back and forth on whether I like the changes from
> 'errors' to 'faults' in this document. Since faults may be latent
> until they are encountered when they become errors. If the error leads
> to an unexpected state then it is termed a failure. So in these
> changes are we referring to the 'fault', which the user may not have
> access to, or the 'error' that results from the fault that the user
> may be affected by.
> For this big of a terminology change throughout the document we should
> have a solid foundation that we are basing it off of. But I keep going
> back and forth in my mind trying to figure out which is the 'right'
> way - what is there vs. what is being proposed.
> Sorry Darius, this is not really the type of feedback that you were
> probably expecting.
> What do others think about all this?
> -- Josh
> ----------------
> [1] "Defect, Fault, Error,..., or Failure?", 1997
> http://dx.doi.org/10.1109/TR.1997.693776
> [2] "Basic concepts and taxonomy of dependable and secure computing", 2004.
> http://dx.doi.org/10.1109/TDSC.2004.2
> http://dx.doi.org/10.1109/FTCSH.1995.532603
> ----------------
> On Mon, Jun 27, 2011 at 5:59 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>> Here are the changes I made.  I've labeled them with ticket999 until we decide how to proceed.  I've included a diff and a pdf. In the pdf, just search for ticket999.
>> Please let me know what you think.
>> Thanks,
>> -d
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

More information about the mpiwg-ft mailing list