Actually the following thread might be more useful for this discussion:<div> <a href="http://lists.mpi-forum.org/mpi3-ft/2011/12/0940.php">http://lists.mpi-forum.org/mpi3-ft/2011/12/0940.php</a></div><div><br></div><div>The example did not come out well in the archives, so below is the diagram again (hopefully that will work):</div>
<div><div>So the process stack looks like:</div><div>P0 P1</div><div>--------------- ----------------</div><div>Dup(comm[X-1]) Dup(comm[X-1])</div><div>MPI_Allreduce() MPI_Allreduce()</div>
<div>Dup(comm[X]) -> Error</div><div> -> Error</div><div><br></div><div>So should P1 be required to call Dup(comm[X])?</div><div><br></div><div>-- Josh</div><br><div class="gmail_quote">On Wed, Jan 25, 2012 at 5:08 PM, Josh Hursey <span dir="ltr"><<a href="mailto:jjhursey@open-mpi.org">jjhursey@open-mpi.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">The current proposal states that MPI object creation functions (e.g., MPI_Comm_create, MPI_Win_create, MPI_File_open):<div>
-------------------</div><div>All participating communicator(s) must be collectively active before calling any communicator creation operation. Otherwise, the communicator creation operation will uniformly raise an error code of the class <span>MPI_ERR_PROC_FAIL_STOP</span>.</div>
<div><br></div><div>If a process failure prevents the uniform creation of the communicator then the communicator construction operation must ensure that the communicator is not created, and all alive participating processes will raise an error code of the class <span>MPI_ERR_PROC_FAIL_STOP</span>. Communicator construction operations will match across the notification of a process failure. As such, all alive processes must call the communicator construction operations the same number of times regardless of whether the emergent process failure makes the call irrelevant to the application.</div>
<div>-------------------</div><div><br></div><div>So there are three points here:</div><div> (1) That the communicator must be 'collectively active' before calling the operation,</div><div> (2) Uniform creation of the communication object, and</div>
<div> (3) Creation operations match across process failure.</div><div><br></div><div>Point (2) seems to be necessary so that all processes only ever see a communication object that is consistent across all processes. This implies a fault tolerant agreement protocol (group membership).</div>
<div><br></div><div>There was a question about why point (3) is necessary. We (Darius, UTK, and I) discussed this on 12/12/2011, and I posted my notes on this to the list:</div><div> <a href="http://lists.mpi-forum.org/mpi3-ft/2011/12/0938.php" target="_blank">http://lists.mpi-forum.org/mpi3-ft/2011/12/0938.php</a></div>
<div>Looking back at my written notes, they don't have much more than that to add to the discussion.</div>
<div><br></div><div>So the problem with (3) seemed to arrise out of the failure handlers, though I am not convinced that they are strictly to blame in this circumstance. It seems that the agreement protocol might be factoring into the discussion as well since it is strongly synchronizing, if not all processes call the operation how does it know when to bail out. The peer processes are (a) calling that operation, (b) going to call it but have not yet, or (c) will never call it because they decided independently not to base on a locally reported process failure.</div>
<div><br></div><div>It seems that the core problem has to do with when to break out of the collective creation operation, and when to restore matching.</div><div><br></div><div>So should re reconsider the restriction on (3)? More to the point, is it safe to not require (3)?</div>
<span class="HOEnZb"><font color="#888888">
<div><br></div><div>-- Josh</div><div><div><br></div>-- <br>Joshua Hursey<br>
Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>
</div>
</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br>Joshua Hursey<br>Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>
</div>