<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hello Rich,<br>
<br>
I thought it was also agreed that if process A communicates with failed
process B<br>
which had been restarted by another process C, and this was the first
communication<br>
from A to B since the restart of B, A would receive the equivalent of a
ECONNRESET error.<br>
This was in the context of a case where option 5 below is not being
used by the app.<br>
<br>
Howard<br>
<br>
Richard Graham wrote:
<blockquote cite="mid:C524020B.26DA7%25rlgraham@ornl.gov" type="cite">
<title>Summary of today's meeting</title>
<font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">Here is a summary of what I think that we
agreed to today. Please correct any errors, and add what I am missing.<br>
<br>
</span></font>
<ul>
<li><font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">We need to be able to restore MPI_COMM_WORLD
(and it’s derivatives) to a usable state when a process fails.
</span></font></li>
<li><font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">Restoration may involve having MPI_PROC_NULL
replace the lost process, or may replaced the lost processes with a new
process (have not specified how this would happen)
</span></font></li>
<li><font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">Processes communicating directly with the
failed processes will be notified via a returned error code about the
failure.
</span></font></li>
<li><font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">When a process is notified of the failure,
comm_repair() must be called. Comm_repair() is not a collective call,
and is what will initiate the communicator repair associated with the
failed process.
</span></font></li>
<li><font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">If a process wants to be notified of process
failure even if it is not communicating directly with this process, it
must register for this notification.
</span></font></li>
<li><font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">We don’t have enough information to know how
to continue with support for checkpoint/restart.
</span></font></li>
<li><font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;">We need to discuss what needs to do with
respect to failure of collective communications.<br>
</span></font></li>
</ul>
<font face="Calibri, Verdana, Helvetica, Arial"><span
style="font-size: 11pt;"><br>
There are several issues that came up with respect to these, which will
be detailed later on.<br>
<br>
Rich<br>
</span></font>
<pre wrap="">
<hr size="4" width="90%">
_______________________________________________
mpi3-ft mailing list
<a class="moz-txt-link-abbreviated" href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a>
<a class="moz-txt-link-freetext" href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a>
</pre>
</blockquote>
<br>
<br>
<pre class="moz-signature" cols="72">--
Howard Pritchard
Cray Inc.
</pre>
</body>
</html>