[Mpi3-ft] list of opaque objects and othe rmpi entities on the list

Supalov, Alexander alexander.supalov at intel.com
Thu Dec 6 02:48:22 CST 2012


Hi guys,

What is considered a process failure? I.e., starting at what moment since the start of the first unsuccessful communication attempt with this process is it considered dead? Asking because we see some big machines having transient link failures that may appear as node failures for a spell. Will user controllable timeouts be sufficient to define a process failure?

Best regards.

Alexander

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Howard Pritchard
Sent: Thursday, December 06, 2012 2:55 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] list of opaque objects and othe rmpi entities on the list

Hi Folks,

Here's the list of mpi opaque objects and a few additional constructs for
consideration of states in the presence of process failures:

communicators - Aourelian, Wesley
groups -  Rich G.
data types - Sayantan
RMA windows - Howard
files (file handles) - Darius B.
info object  - Darius
error handler - Darius
message obj. - David S.
request - Manjo
status - Manjo
op  - Darius
port (mpi-2 dynamic) - David S.
user buffers attached to MPI for bsends - Sayantan

Need to define lifecycle of the object in the case of no process failures,
and in the case when one or more process failures occur while the
object exists.

Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen, Deutschland
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Christian Lamprechter, Hannes Schwaderer, Douglas Lusk
Registergericht: Muenchen HRB 47456
Ust.-IdNr./VAT Registration No.: DE129385895
Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20121206/1774763c/attachment-0001.html>


More information about the mpiwg-ft mailing list