[Mpi3-ft] list of opaque objects and othe rmpi entities on the list

Thu Dec 6 08:47:41 CST 2012

Thanks. Makes sense, as long as what is not specified in the standard may be allowed to become very well specified and if necessary user controllable in a particular implementation (e.g., CRC on/off, link timeout, node timeout, and so on).

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Aurélien Bouteiller
Sent: Thursday, December 06, 2012 3:23 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] list of opaque objects and othe rmpi entities on the list

Alexander, 

We had this exact discussion yesterday. The short highlights are the following: 
* We model process failures only, but other failure modes can be handled by converting them into fail-stop on an implementation decided basis

* Transient failures can be of two kinds: 
  * A processor is slow/unresponsive for a time, but comes back. The decision here is left to the implantation as to when to declare this as a process failure, but once it has declared so, even if the process stops being transient, it will be ignored/disposed off. 
  * A processor cannot be accessed by group of processes A, but can still be accessed by group of processes B (network disjunction). In that case, some processes in group A will report a process failure to the application. It is then the responsibility of users to propagate this error detection (through revoke, most presumably), if the application pattern cannot support such a condition (or continue business as usual if communication pattern so permit, but MPI cannot know that). 

We think that we should intentionally refrain from forcing a particular behavior/timeout on implementors in the case of transient errors. A "perfect" failure detector is impossible, both in theory and in practice. We decided it would be better to report false positives (declare some alive processes as dead), than the contrary (leading to application deadlock). The options to provide failure detection (and discriminating between transient and fail-stop) are numerous, some implementations may have access to better detection (hardware based) mechanisms than other, and specifying something too strong on the quality of failure detection could lead to unacceptable performance penalty in some cases. 
Also, we can see no reason for the user to be involved in setting such timeouts (actually, that would hurt portability, as the user could set values that are great on machine A, but moronic on machine B and lead to mass suicide). 

A related issue is when considering routed P2P. One route may be broken because an intermediate hop node is dead. It is then responsibility of the implementation to silently reroute around failures. The implementation should not return errors to users as long as it can reasonably reach a non-dead process. If a process becomes definitely unreachable, it will be reported as dead, and the application will have to assume so during recovery. Again, we refrained from mandating such a behavior, as it can get expensive, an implementation is perfectly fine returning an error as soon as a route is broken, but a high quality implementation should strive to avoid reporting false positive errors. 

Byzantine errors are not treated at all by the group, and we think it is either impossible or not a standard business. 
* If a byzantine process does random actions (such as sending extra messages at random to people that are not expecting them), all bets are off, the matching will be wrong, and MPI cannot tell. This is known as being a somewhat intractable problem in theory, except in very limited cases (using assumption such as incorruptible code memory and self stabilization). 
* If we restrain to message content corruption, CRC/checksum can be done by the MPI implementation, that doesn't require API changes to provide this feature (the implementation computes checksums silently and does retransmit internally). There could be some applications that could tolerate degraded signal, but determining the "quality" of the signal is application dependent, and such an application could be written in MPI-2 already (doing user message checksum and checking and sending these buffers with unprotected MPI). 
* Memory corruption cannot be handled by MPI, its the responsibility of the application to detect and resolve these. MPI is in the business of exchanging messages, not fixing broken memory. 

===

So, to answer in short, it is intentionally left undefined the exact "moment" where a transient error stops being considered as such and becomes a fail-stop error. The implementation decides, based on knowledge of hardware features of the machine and the availability (or not) of a quality failure detector. We only mandate that it decides "at some point in the future" so that no operation deadlock forever; again, the exact time allowed before deciding is not specified intentionally, as it will probably change with every hardware generations. 

Regards,
Aurelien 

Le 6 déc. 2012 à 03:48, "Supalov, Alexander" <alexander.supalov at intel.com> a écrit :

> Hi guys,
>  
> What is considered a process failure? I.e., starting at what moment since the start of the first unsuccessful communication attempt with this process is it considered dead? Asking because we see some big machines having transient link failures that may appear as node failures for a spell. Will user controllable timeouts be sufficient to define a process failure?
>  
> Best regards.
>  
> Alexander
>  
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Howard Pritchard
> Sent: Thursday, December 06, 2012 2:55 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] list of opaque objects and othe rmpi entities on the list
>  
> Hi Folks,
>  
> Here's the list of mpi opaque objects and a few additional constructs for
> consideration of states in the presence of process failures:
>  
> communicators - Aourelian, Wesley
> groups -  Rich G.
> data types - Sayantan
> RMA windows - Howard
> files (file handles) - Darius B.  
> info object  - Darius
> error handler - Darius
> message obj. - David S.
> request - Manjo
> status - Manjo
> op  - Darius
> port (mpi-2 dynamic) - David S.
> user buffers attached to MPI for bsends - Sayantan
>  
> Need to define lifecycle of the object in the case of no process failures,
> and in the case when one or more process failures occur while the
> object exists.
>  
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen, Deutschland
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Christian Lamprechter, Hannes Schwaderer, Douglas Lusk
> Registergericht: Muenchen HRB 47456
> Ust.-IdNr./VAT Registration No.: DE129385895
> Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen, Deutschland
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Christian Lamprechter, Hannes Schwaderer, Douglas Lusk
Registergericht: Muenchen HRB 47456
Ust.-IdNr./VAT Registration No.: DE129385895
Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052