<HTML>

<HEAD>

<TITLE>Re: [Mpi3-ft] Transactional Messages</TITLE>

</HEAD>

<BODY>

<FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'>Actually, it is most likely that MPI implementations that don’t try to deal with<BR>

 dropped messages, can’t even detect that such event have occurred.  For<BR>

 such implementation I would expect them to be able to detect a problem with<BR>

 failed communications only if the low-level library they use to implement<BR>

 the communications, such as some OS bypass library, returns an error when<BR>

 trying to post some sort of communications, or if the run-time used by MPI<BR>

 detects a fail process, and propagates this information to the rest of the<BR>

 processes in the application.<BR>

<BR>

The ONLY layer that can handle any sort of recovery from a live communications failure -<BR>

 i.e. w/o some sort of check-point restart with or with out message logging – is the<BR>

 MPI implementation itself.  The application reposting a send can’t take get around the<BR>

 lost data, because of the MPI message ordering requirements, unless the implementation<BR>

 totally relies on another library to satisfy the MPI ordering requirements (i.e. it does not<BR>

 generate some sort of message sequence number) and the message lost is the last one<BR>

 that was sent.  MPI is not allowed to attempt any matching if there is a gap in the<BR>

 sequence number.<BR>

<BR>

<BR>

Rich<BR>

<BR>

<BR>

On 2/22/08 10:22 PM, "Greg Bronevetsky" <bronevetsky1@llnl.gov> wrote:<BR>

<BR>

</SPAN></FONT><BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>

<BR>

>I've read the Transactional Messages proposal and I am a ittle confused<BR>

>here.  Is there a reason why we believe that message faults themselves<BR>

>should be handled by the application layer instead of the MPI library?<BR>

>Using the latter model allows one to reduce the error conditions<BR>

>perculated up to the user to revolve around loss of the actual<BR>

>connection to a process (or the actual process itself).<BR>

<BR>

Actually, one aspect of the proposal is that I made sure not to<BR>

define message faults at a low level. They may be any low-level<BR>

problems that the implementation cannot efficiently deal with on its<BR>

own and that are best represented to the application as message<BR>

drops. One example of this may be process failures. Although we will<BR>

probably want to define a special notification mechanism to expose<BR>

those failures to the application, we will also need a way to expose<BR>

the failures of any communication that involves the process. Another<BR>

example may be simplified MPI implementations that do not have<BR>

facilities for resending messages because the probability of an error<BR>

is rather low and performance is too important. In fact, applications<BR>

that can tolerate message drops may explicitly choose those MPI<BR>

implementations for the performance gains.<BR>

<BR>

Greg Bronevetsky<BR>

Post-Doctoral Researcher<BR>

1028 Building 451<BR>

Lawrence Livermore National Lab<BR>

(925) 424-5756<BR>

bronevetsky1@llnl.gov<BR>

_______________________________________________<BR>

Mpi3-ft mailing list<BR>

Mpi3-ft@lists.mpi-forum.org<BR>

<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><BR>

<BR>

</SPAN></FONT></BLOCKQUOTE><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>

</SPAN></FONT>

</BODY>

</HTML>