<HTML>
<HEAD>
<TITLE>Re: [Mpi3-ft] Transactional Messages</TITLE>
</HEAD>
<BODY>
<FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'>Actually, it is most likely that MPI implementations that don’t try to deal with<BR>
dropped messages, can’t even detect that such event have occurred. For<BR>
such implementation I would expect them to be able to detect a problem with<BR>
failed communications only if the low-level library they use to implement<BR>
the communications, such as some OS bypass library, returns an error when<BR>
trying to post some sort of communications, or if the run-time used by MPI<BR>
detects a fail process, and propagates this information to the rest of the<BR>
processes in the application.<BR>
<BR>
The ONLY layer that can handle any sort of recovery from a live communications failure -<BR>
i.e. w/o some sort of check-point restart with or with out message logging – is the<BR>
MPI implementation itself. The application reposting a send can’t take get around the<BR>
lost data, because of the MPI message ordering requirements, unless the implementation<BR>
totally relies on another library to satisfy the MPI ordering requirements (i.e. it does not<BR>
generate some sort of message sequence number) and the message lost is the last one<BR>
that was sent. MPI is not allowed to attempt any matching if there is a gap in the<BR>
sequence number.<BR>
<BR>
<BR>
Rich<BR>
<BR>
<BR>
On 2/22/08 10:22 PM, "Greg Bronevetsky" <bronevetsky1@llnl.gov> wrote:<BR>
<BR>
</SPAN></FONT><BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>
<BR>
>I've read the Transactional Messages proposal and I am a ittle confused<BR>
>here. Is there a reason why we believe that message faults themselves<BR>
>should be handled by the application layer instead of the MPI library?<BR>
>Using the latter model allows one to reduce the error conditions<BR>
>perculated up to the user to revolve around loss of the actual<BR>
>connection to a process (or the actual process itself).<BR>
<BR>
Actually, one aspect of the proposal is that I made sure not to<BR>
define message faults at a low level. They may be any low-level<BR>
problems that the implementation cannot efficiently deal with on its<BR>
own and that are best represented to the application as message<BR>
drops. One example of this may be process failures. Although we will<BR>
probably want to define a special notification mechanism to expose<BR>
those failures to the application, we will also need a way to expose<BR>
the failures of any communication that involves the process. Another<BR>
example may be simplified MPI implementations that do not have<BR>
facilities for resending messages because the probability of an error<BR>
is rather low and performance is too important. In fact, applications<BR>
that can tolerate message drops may explicitly choose those MPI<BR>
implementations for the performance gains.<BR>
<BR>
Greg Bronevetsky<BR>
Post-Doctoral Researcher<BR>
1028 Building 451<BR>
Lawrence Livermore National Lab<BR>
(925) 424-5756<BR>
bronevetsky1@llnl.gov<BR>
_______________________________________________<BR>
Mpi3-ft mailing list<BR>
Mpi3-ft@lists.mpi-forum.org<BR>
<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><BR>
<BR>
</SPAN></FONT></BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>
</SPAN></FONT>
</BODY>
</HTML>