[Mpi3-ft] MPI 3.0 FT WG: Error notification
Narasimhan, Kannan
kannan.narasimhan at hp.com
Fri Aug 1 10:53:48 CDT 2008
Erez,
Is the error in a virtual connection (and communicator state) in the context of a rank? Maybe my question is more applicable to collective operations -- but if VC4 (Rank0->Rank4 connection) is flagged with an error state during a Send call from Rank0 to Rank4, are you sugesting that this VC state made consistent across ranks (i.e synchronized)?
Thanx!
Kannan
-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Wednesday, July 30, 2008 11:27 AM
To: MPI 3.0 Fault Tolerance
Subject: [Mpi3-ft] FW: MPI 3.0 FT WG: Error notification
FYI,
I did not post this to the wiki pages yet.
-----Original Message-----
From: Erez Haba
Sent: Tuesday, July 29, 2008 11:20 AM
To: 'Narasimhan, Kannan'
Subject: RE: MPI 3.0 FT WG: Error notification
Hi Kannan,
I didn't plan to define the error semantics for the collectives (although we should). At first I think I'd focus on defining the error semantics as it is today, to define when an error should be returned (and error handler should be called). The purpose is to better define error handling for libraries, to allow them to recover on their own turf.
The idea is to define the error semantic in a way that is compatible with today's programs.
Here are the few rules I have in mind,
1. errors are returned per call site, and associated with the call context 2. introduce the concept of a virtual connection 3. a vc's can move to an error state, once unable to communicate with its peer 3. a vc move to an error state cause all the communicators associated with it move to an error state 4. a vc error state affects only calls associated with that vc 5. communicator error state affects only receives with MPI_ANY_SOURCE
Examples,
Example 1: (running on rank 0)
MPI_Irecv(source=4, &request4)
//
// error in the communication with rank4 detected, the communicator moves to // an error state, but the send call returns success //
MPI_Send(dest=3)
//
// the error is associated with request4 and thus returned in the MPI_Wait call //
MPI_Wait(request4)
Example 2: (running on rank 0)
MPI_Irecv(source=any, &request)
//
// error in the communication with rank4 detected, the communicator moves to // an error state, but the send call returns success //
MPI_Send(dest=3)
//
// The communicator set to the error state, any receives with MPI_ANY_SOURCE // are set to error state and thus return an error in their wait call //
MPI_Wait(request)
Example 3: (running on rank 0)
//
// error in the communication with rank4 detected, the communicator moves to // an error state, but the send call returns success //
MPI_Send(dest=3)
//
// The communicator set to the error state, any receives with MPI_ANY_SOURCE // are set to error state and thus return an error.
//
MPI_Recv(source=any)
//
// VC 4 is set to the error state, any usage of that VC will return an error.
//
MPI_Recv(source=4)
MPI_Send(dest=4)
Here, continue with examples using multiple communicators.
Make sense?
Let me know what you think, I'll post that (and more, to the FT wiki pages)
Thanks,
.Erez
-----Original Message-----
From: Narasimhan, Kannan [mailto:kannan.narasimhan at hp.com]
Sent: Tuesday, July 29, 2008 4:04 AM
To: Erez Haba
Cc: Narasimhan, Kannan
Subject: MPI 3.0 FT WG: Error notification
Hi Erez,
Following up on our discussion in the MPI 3.0 FT Working group meeting --- Did U get to start on the error semantics/notification for a sample MPI collective call? Let me know if you want to bounce some options/ideas on this topic -- we can hash it out over email/phone call before this Friday's WG confcall.
Thanx!
Kannan
_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list