[Mpi3-ft] FW: MPI 3.0 FT WG: Error notification

Erez Haba erezh at MICROSOFT.com
Wed Jul 30 11:26:30 CDT 2008


FYI,
I did not post this to the wiki pages yet.

-----Original Message-----
From: Erez Haba
Sent: Tuesday, July 29, 2008 11:20 AM
To: 'Narasimhan, Kannan'
Subject: RE: MPI 3.0 FT WG: Error notification

Hi Kannan,

I didn't plan to define the error semantics for the collectives (although we should). At first I think I'd focus on defining the error semantics as it is today, to define when an error should be returned (and error handler should be called). The purpose is to better define error handling for libraries, to allow them to recover on their own turf.


The idea is to define the error semantic in a way that is compatible with today's programs.

Here are the few rules I have in mind,

1. errors are returned per call site, and associated with the call context
2. introduce the concept of a virtual connection
3. a vc's can move to an error state, once unable to communicate with its peer
3. a vc move to an error state cause all the communicators associated with it move to an error state
4. a vc error state affects only calls associated with that vc
5. communicator error state affects only receives with MPI_ANY_SOURCE

Examples,

Example 1: (running on rank 0)

MPI_Irecv(source=4, &request4)

//
// error in the communication with rank4 detected, the communicator moves to
// an error state, but the send call returns success
//
MPI_Send(dest=3)

//
// the error is associated with request4 and thus returned in the MPI_Wait call
//
MPI_Wait(request4)


Example 2: (running on rank 0)

MPI_Irecv(source=any, &request)

//
// error in the communication with rank4 detected, the communicator moves to
// an error state, but the send call returns success
//
MPI_Send(dest=3)

//
// The communicator set to the error state, any receives with MPI_ANY_SOURCE
// are set to error state and thus return an error in their wait call
//
MPI_Wait(request)


Example 3: (running on rank 0)

//
// error in the communication with rank4 detected, the communicator moves to
// an error state, but the send call returns success
//
MPI_Send(dest=3)

//
// The communicator set to the error state, any receives with MPI_ANY_SOURCE
// are set to error state and thus return an error.
//
MPI_Recv(source=any)

//
// VC 4 is set to the error state, any usage of that VC will return an error.
//
MPI_Recv(source=4)
MPI_Send(dest=4)


Here, continue with examples using multiple communicators.

Make sense?
Let me know what you think, I'll post that (and more, to the FT wiki pages)

Thanks,
.Erez

-----Original Message-----
From: Narasimhan, Kannan [mailto:kannan.narasimhan at hp.com]
Sent: Tuesday, July 29, 2008 4:04 AM
To: Erez Haba
Cc: Narasimhan, Kannan
Subject: MPI 3.0 FT WG: Error notification

Hi Erez,

Following up on our discussion in the MPI 3.0 FT Working group meeting --- Did U get to start on the error semantics/notification for  a sample MPI collective call? Let me know if you want to bounce some options/ideas on this topic -- we can hash it out over email/phone call before this Friday's WG confcall.

Thanx!
Kannan




More information about the mpiwg-ft mailing list