[Mpi3-ft] Asynchronous error handling
rlgraham at ornl.gov
Mon Jun 2 18:09:02 CDT 2008
I like your suggestion how about if we adopt what the CIFTS project is
our model (s) ? These are methods already in use in other contexts, and
been proven to be useful.
Seems like there are several items that would need to be addressed such
- Is reliable delivery guaranteed ?
- Is notification unique i.e., can we have an error code returned from
and a callback also be generated, both for the (subscribed) error
- What happens with errors that impact correctness, but that have not been
Lets schedule a long chunk of time (about 4 hours) for our working group
to meet and
talk at our next meeting we will follow up on what we discuss this coming
On 5/26/08 11:51 AM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov> wrote:
>> > On the telecon today we agreed to have our next telecon on 6/6 focus on >>
>> >we may handle asynchronous error notification within MPI. The working
>> >assumption is that we will still have return error codes, but also make use
>> >of asynchronous notification. We need to
>> > - Clearly define the boundary between these two different error
>> >notification mechanisms, i.e., when we use one and when the other
>> > - Define the precise mechanism for asynchronous error notification
>> >This e-mail is intended to jump start discussion in preparation for the next
> I'll throw something out here. I wasn't around for the initial
> discussions, so some of this may fly in the face of something that
> people have already decided is obviously wrong. Either way, its a
> start. You may commence with the tomato throwing.
> The idea for this proposal is a publish-subscribe model where the
> spec defines the default publish-subscribe relations but allows MPI
> implementations to define new events and default and allows
> applications to cancel/add new event subscriptions. I like this model
> mostly because it is the one being used by the CIFTS project, which I
> suspect will have an important role to play in MPI application fault
> tolerance. Since we won't be able to list all the possible errors
> that may occur, we'll need to define the possible error types and
> describe describe the error notification properties of these broad
> types, rather than individual events. Implementations may then put
> each real error into any type that is deemed appropriate.
> Every error will have a defined detection set, which is the set of
> processes that by default subscribe to being notified of this event.
> For example, if a given process fails, any process that tries to
> receive a message from this process is definitely within its
> detection radius. However, if the failed process is a receiver in a
> broadcast, we may or may not choose to include the other broadcast
> receivers in the detection radius (probably not). Each process is
> subscribed to all error events that happen in the process, as long as
> the errors don't cause the process itself to fail.
> For each failure event type we will define the latest point in time
> when each process within the event's detection set will be notified.
> For example, if process p fails, all other processes must be notified
> no later than their next receive call that must receive from p (i.e.
> receives with MPI_ANY_SOURCE don't qualify). For errors that cause
> process state to be corrupted, we may want to inform other processes
> no later than the first point in time when their state becomes
> dependent on the corruption. The MPI implementation may deliver the
> event at this latest point using the synchronous error API or at any
> earlier point in time using the asynchronous API.
> The synchronous API will be a direct extension of the current error
> reporting API. The asynchronous API will take the form of an events
> queue that may be explicitly polled by the application to see if
> there are any pending events. Applications will also be able to
> register a callback function that will automatically be called by MPI
> whenever a new event arrives. Furthermore, processes may subscribe to
> events emanating from other processes as they see fit. For example,
> the application may designate one or more processes as error monitors
> and these processes would register themselves to listen to all other
> processes and take appropriate corrective measures if something goes wrong.
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft