[Mpi3-ft] "Error" to "Fault" changes
Josh Hursey
jjhursey at open-mpi.org
Wed Jun 29 08:46:02 CDT 2011
Going through the changes, some of the changes sound good and others
I'm not sure about. Instead of iterating through them on the mailing
list, let's make sure to go through the changes on the call today. I
think talking through them will help. We need to deal with any issues
in the RTS proposal first (but I don't expect that to take long) then
we can devote the rest of the time to this proposal.
A few notes:
* If we like the definitions of Error and Failure presented in this
thread, then we should refine those and replace the definitions in the
current document before it ships.
* This is a nice wording: P23,L8: Whenever possible, MPI calls return
an error code if [an error]a fault occurred during the call.
* I like the 'an error is raised' wording. I think that is clear. The
notion of an MPI Exception is not so clear to me. If we are going to
use 'execption' we should clearly differentiate it from 'error' and
'fault' - more so that the standard currently does on P302,L29.
* I'm concerned that on P22,L46 and P23,L1 we are changing the terms
being defined. This seems pretty disruptive. I would almost prefer
defining 'program/resource error' and 'program/resource fault'
separately here so that the reader understands the differentiation.
>From the reading it may be useful to remember that a fault at one
level of the system may be an error at a higher level of the system -
since the definition of a system is recursive to the base case of an
atomic component. Not sure if that helps, but it is one of the aspects
of the language that I am getting hung up on.
-- Josh
On Tue, Jun 28, 2011 at 6:39 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>
> Actually, this is the kind of feedback I hoped for. :-) I agree that we need to get it right.
>
> I believe that we can replace "error" with "error code" (or "error class," as appropriate) almost everywhere in the changed document (the one I sent out). That might make it easier to understand wrt the "error" definition below.
>
> On the other hand, there were places in the standard where "exception" was used. We can define an exception as a detected error. So: A fault might result in an error. If MPI detects an error, it will raise an exception.
>
> Does that help any?
>
> -d
>
> On Jun 28, 2011, at 5:11 PM, Josh Hursey wrote:
>
>> First thanks for taking a first pass at this. Your changes got be
>> thinking about what you/we mean by 'faults' versus 'errors' and how
>> they should be used in the document. This lead me down the path of
>> researching terminology to see if I can find precise, agreed upon
>> definitions of both.
>>
>> I found a bunch of papers, but the three cited at bottom do a good job
>> of summarizing what the IEEE seems to have agreed upon. It is a bit
>> difficult to define just one piece without all of the supporting
>> definitions (e.g., what is a 'system' or 'service'), but I took a stab
>> at it below:
>>
>>
>> Error:
>> ------
>> An error is the deviation of expected behavior from correct operation
>> of the system (e.g., MPI library, MPI operation). Errors are caused by
>> faults in one or more components of the system (e.g., memory
>> corruption, network link failure).
>>
>> Failure:
>> --------
>> A failure is when the intended function of the system (e.g., MPI
>> library, MPI operation) cannot be delivered because of one or more
>> errors.
>>
>>
>> A couple of interesting quotes from the papers:
>> [2] "Fault tolerance means to avoid service failures in the presence
>> of faults."
>> [3] "Fault Tolerance is carried out by error processing, which may be
>> automatic or operator-assisted."
>>
>>
>> So I am going back and forth on whether I like the changes from
>> 'errors' to 'faults' in this document. Since faults may be latent
>> until they are encountered when they become errors. If the error leads
>> to an unexpected state then it is termed a failure. So in these
>> changes are we referring to the 'fault', which the user may not have
>> access to, or the 'error' that results from the fault that the user
>> may be affected by.
>>
>> For this big of a terminology change throughout the document we should
>> have a solid foundation that we are basing it off of. But I keep going
>> back and forth in my mind trying to figure out which is the 'right'
>> way - what is there vs. what is being proposed.
>>
>> Sorry Darius, this is not really the type of feedback that you were
>> probably expecting.
>>
>> What do others think about all this?
>>
>> -- Josh
>>
>> ----------------
>> [1] "Defect, Fault, Error,..., or Failure?", 1997
>> http://dx.doi.org/10.1109/TR.1997.693776
>>
>> [2] "Basic concepts and taxonomy of dependable and secure computing", 2004.
>> http://dx.doi.org/10.1109/TDSC.2004.2
>>
>> [3] "DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY", 1995.
>> http://dx.doi.org/10.1109/FTCSH.1995.532603
>> ----------------
>>
>>
>> On Mon, Jun 27, 2011 at 5:59 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>>
>>> Here are the changes I made. I've labeled them with ticket999 until we decide how to proceed. I've included a diff and a pdf. In the pdf, just search for ticket999.
>>>
>>> Please let me know what you think.
>>>
>>> Thanks,
>>> -d
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
More information about the mpiwg-ft
mailing list