[Mpi3-ft] MPI 3.0 FT WG: Error notification

Erez Haba erezh at MICROSOFT.com
Fri Aug 8 02:29:40 CDT 2008


I put the two suggestions in this paper on the FT wiki pages; as,

https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/error_report_rules
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/comm_integrity

linked from the FT page
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage

thanks,
.Erez

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Thursday, August 07, 2008 6:51 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] MPI 3.0 FT WG: Error notification

Hi all,

Attached is an update on the suggestion below. This is still a very early draft where I try to capture some of the ideas and discussions we're having. I expect to add more content to this document as we understand it better. I think that this is still very very basic for FT and we have still a long way to go.


Thanks,
.Erez


-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Wednesday, July 30, 2008 9:27 AM
To: MPI 3.0 Fault Tolerance
Subject: [Mpi3-ft] FW: MPI 3.0 FT WG: Error notification

FYI,
I did not post this to the wiki pages yet.

-----Original Message-----
From: Erez Haba
Sent: Tuesday, July 29, 2008 11:20 AM
To: 'Narasimhan, Kannan'
Subject: RE: MPI 3.0 FT WG: Error notification

Hi Kannan,

I didn't plan to define the error semantics for the collectives (although we should). At first I think I'd focus on defining the error semantics as it is today, to define when an error should be returned (and error handler should be called). The purpose is to better define error handling for libraries, to allow them to recover on their own turf.


The idea is to define the error semantic in a way that is compatible with today's programs.

Here are the few rules I have in mind,

1. errors are returned per call site, and associated with the call context 2. introduce the concept of a virtual connection 3. a vc's can move to an error state, once unable to communicate with its peer 3. a vc move to an error state cause all the communicators associated with it move to an error state 4. a vc error state affects only calls associated with that vc 5. communicator error state affects only receives with MPI_ANY_SOURCE

Examples,

Example 1: (running on rank 0)

MPI_Irecv(source=4, &request4)

//
// error in the communication with rank4 detected, the communicator moves to // an error state, but the send call returns success //
MPI_Send(dest=3)

//
// the error is associated with request4 and thus returned in the MPI_Wait call //
MPI_Wait(request4)


Example 2: (running on rank 0)

MPI_Irecv(source=any, &request)

//
// error in the communication with rank4 detected, the communicator moves to // an error state, but the send call returns success //
MPI_Send(dest=3)

//
// The communicator set to the error state, any receives with MPI_ANY_SOURCE // are set to error state and thus return an error in their wait call //
MPI_Wait(request)


Example 3: (running on rank 0)

//
// error in the communication with rank4 detected, the communicator moves to // an error state, but the send call returns success //
MPI_Send(dest=3)

//
// The communicator set to the error state, any receives with MPI_ANY_SOURCE // are set to error state and thus return an error.
//
MPI_Recv(source=any)

//
// VC 4 is set to the error state, any usage of that VC will return an error.
//
MPI_Recv(source=4)
MPI_Send(dest=4)


Here, continue with examples using multiple communicators.

Make sense?
Let me know what you think, I'll post that (and more, to the FT wiki pages)

Thanks,
.Erez

-----Original Message-----
From: Narasimhan, Kannan [mailto:kannan.narasimhan at hp.com]
Sent: Tuesday, July 29, 2008 4:04 AM
To: Erez Haba
Cc: Narasimhan, Kannan
Subject: MPI 3.0 FT WG: Error notification

Hi Erez,

Following up on our discussion in the MPI 3.0 FT Working group meeting --- Did U get to start on the error semantics/notification for  a sample MPI collective call? Let me know if you want to bounce some options/ideas on this topic -- we can hash it out over email/phone call before this Friday's WG confcall.

Thanx!
Kannan

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft





More information about the mpiwg-ft mailing list