[Mpi3-ft] Transactional Messages
Richard Graham
rlgraham at ornl.gov
Fri Feb 22 21:51:03 CST 2008
Actually, it is most likely that MPI implementations that don¹t try to deal
with
dropped messages, can¹t even detect that such event have occurred. For
such implementation I would expect them to be able to detect a problem with
failed communications only if the low-level library they use to implement
the communications, such as some OS bypass library, returns an error when
trying to post some sort of communications, or if the run-time used by MPI
detects a fail process, and propagates this information to the rest of the
processes in the application.
The ONLY layer that can handle any sort of recovery from a live
communications failure -
i.e. w/o some sort of check-point restart with or with out message logging
is the
MPI implementation itself. The application reposting a send can¹t take get
around the
lost data, because of the MPI message ordering requirements, unless the
implementation
totally relies on another library to satisfy the MPI ordering requirements
(i.e. it does not
generate some sort of message sequence number) and the message lost is the
last one
that was sent. MPI is not allowed to attempt any matching if there is a
gap in the
sequence number.
Rich
On 2/22/08 10:22 PM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov> wrote:
>
>
>> >I've read the Transactional Messages proposal and I am a ittle confused
>> >here. Is there a reason why we believe that message faults themselves
>> >should be handled by the application layer instead of the MPI library?
>> >Using the latter model allows one to reduce the error conditions
>> >perculated up to the user to revolve around loss of the actual
>> >connection to a process (or the actual process itself).
>
> Actually, one aspect of the proposal is that I made sure not to
> define message faults at a low level. They may be any low-level
> problems that the implementation cannot efficiently deal with on its
> own and that are best represented to the application as message
> drops. One example of this may be process failures. Although we will
> probably want to define a special notification mechanism to expose
> those failures to the application, we will also need a way to expose
> the failures of any communication that involves the process. Another
> example may be simplified MPI implementations that do not have
> facilities for resending messages because the probability of an error
> is rather low and performance is too important. In fact, applications
> that can tolerate message drops may explicitly choose those MPI
> implementations for the performance gains.
>
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
> _______________________________________________
> Mpi3-ft mailing list
> Mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20080222/45ed4734/attachment-0001.html>
More information about the mpiwg-ft
mailing list