Yeah I think that is more of a clarification on (A) - if I understand your email correctly.<div>- If the input communicator is -not- collectively active</div><div> - Communicator creation operations will fail in error at all alive calling processes</div>
<div> - So the creation operations never work around emerging failed processes</div><div>- If the input communicator is collectively active</div><div> - Communication creation operations will return successfully, unless</div>
<div> - There is an emerging 'new' failure that makes the communicator collectively inactive, in which case the operation will either</div><div> - Return successfully everywhere if it has passed the decision point in the protocol (at least one alive process returned 'success' already)</div>
<div> - Return an error everywhere if it is before the decision point in the protocol. Then the decision is 'return an error' and everyone will decide to return an error.</div><div>- Once collectively inactive subsequent communicator creation operations will return in error immediately</div>
<div> - MPI_Comm_validate is used to reenable the collectives including communicator creation operations.</div><div> - The MPI_Comm_vallidate when it switches a communicator from collectively inactive to collectively active creates a 'cut' in the message matching. Meaning that collective operations (including communicator creation operations) initiated before the MPI_Comm_validate (under these circumstances) do not need to match across the logical 'cut'. So the user does -not- need to post a matching MPI_Bcast or MPI_Comm_dup after the MPI_Comm_validate if a new failure caused the cut to occur. -- But this gets into more of the issues of (3).</div>
<div><br></div><div>How do those semantics sound to folks?</div><div><br></div><div>-- Josh</div><div><br><div class="gmail_quote">On Thu, Feb 2, 2012 at 5:28 PM, Sur, Sayantan <span dir="ltr"><<a href="mailto:sayantan.sur@intel.com">sayantan.sur@intel.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div lang="EN-US" link="blue" vlink="purple">
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Hi Josh,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">I agree on (1).<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Regarding (2), I’m wondering if another case C is feasible. What do you think?<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">( C ) if the input communicator is not required to be collectively active: MPI_Comm_split returns with error everywhere (all live processes) if it discovers
any existing or emerging failure. i.e. it does not try to ‘work around’ either type of failure existing or emerging.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">i.e. MPI_Comm_split automatically returns “valid” communicators, if the original communicator had no failures. In case of failure, it simply returns error.
In this case, MPI_Comm_validate() needs to be called in the error path, not in the error free path.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Thanks.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">===</span><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Sayantan Sur, Ph.D.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d">Intel Corp.<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1f497d"><u></u> <u></u></span></p>
<div style="border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt">
<div>
<div style="border:none;border-top:solid #b5c4df 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> <a href="mailto:mpi3-ft-bounces@lists.mpi-forum.org" target="_blank">mpi3-ft-bounces@lists.mpi-forum.org</a> [mailto:<a href="mailto:mpi3-ft-bounces@lists.mpi-forum.org" target="_blank">mpi3-ft-bounces@lists.mpi-forum.org</a>]
<b>On Behalf Of </b>Josh Hursey<br>
<b>Sent:</b> Wednesday, February 01, 2012 2:26 PM<br>
<b>To:</b> MPI 3.0 Fault Tolerance and Dynamic Process Control working Group<br>
<b>Subject:</b> Re: [Mpi3-ft] Matching MPI communication object creation across process failure<u></u><u></u></span></p>
</div>
</div><div><div class="h5">
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Some notes regarding this thread from the teleconf:<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">There are three components to the discussion of how communicator creation operations should behave:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> - (1) Uniform creation of the object (created everywhere or no where)<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> - (2) Requires the input communicator to be collective active<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> - (3) Matches across emerging process failure<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">(1) seems to be something we all agree on.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">(2) we discussed a bit more on the call. Consider an MPI_Comm_split operation where the application wants to divide the communicator in half.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> - (A) If the input communicator is required to be collectively active then the MPI_Comm_split will return an error if a process fails (makes the input communicator collective inactive) and the object cannot be uniformly created. So MPI_Comm_split
does not 'work around' emerging failure, but errors out at the first sign of the failure before the 'decision point' in the protocol. The 'decision point' being the point in time where the communication is designated a created.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> - (B) if the input communicator is -not- required to be collective active then the MPI_Comm_split must be able to work around existing and emerging failure to agree upon the group membership of the output communicator(s). The resulting
communicators may be of unintended size since the MPI_Comm_split is working around the errors. So the user is forced to call MPI_Comm_validate after the MPI_Comm_split on the input communicator to make sure that all processes exited the MPI_Comm_split with
acceptable communicators (if such a distinction is important).<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So in (A) the following program is correct:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">-------------------------<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">if( comm_rank < (comm_size/2) ) {<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> color = 0;<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">} else {<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> color = 1;<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">}<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">do {<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> ret = MPI_Comm_split( comm, color, key, &new_comm );<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> if( ret == MPI_ERR_PROC_FAIL_STOP ) {<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> /* Re-enable collectives over this communicator */<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> MPI_Comm_validate( comm, &failed_grp );<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> MPI_Group_free( &failed_grp );<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> /* Adjust colors to account for failed processes *<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> }<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> /* If there was a process failure, then try again */<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">} while( ret == MPI_ERR_PROC_FAIL_STOP );<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">-------------------------<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">In this example MPI_Comm_split will either return successfully with the exact groups asked for, or will return an error if a new process failure is encountered before the communicators are created. One nice thing to point out is that MPI_Comm_validate()
is only called in the error path, and not in the failure free path.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">In (B) it is less likely that MPI_Comm_split will return in error since it is working around existing and emerging process failure. Since it is working around emerging failure so it is possible that the resulting communicators are of a
size that is not acceptable to the application. So the application will need to do some collective, and call MPI_Comm_validate to ensure that collective completed everywhere before deciding to use the new communicators.<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">-------------------------<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">...<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">} while( ret == MPI_ERR_PROC_FAIL_STOP );<u></u><u></u></p>
</div>
</div>
<div>
<p class="MsoNormal">/* Collective operation to check if communicators are of acceptable length */<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">/* MPI_Comm_validate() to make sure the collective above completed successfully everywhere */<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">/* If acceptable then continue */<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">/* If unacceptable, then destroy them and call MPI_Comm_split again with new colores */<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">-------------------------<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So (B) pushes the user to call MPI_Comm_validate in the failure free code path, and do some additional checking. Option (A) keeps the MPI_Comm_validate in the failure full/error path.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So it seems that requiring the input communicator to be collectively active (option A) is somewhat better.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Regarding matching across failure (3), this seems to be implied by (1). Since the uniformity of returning is provided by an agreement protocol. The question is how difficult would it be to not require matching across failure (3) if we know
that the creation call can is allowed to exit in error if it encounters an error before agreement (Option 2.A above)? I am fairly sure that we can implement this safely, but I would like to dwell on it a bit more to be sure.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">What do others think about these points?<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">-- Josh<u></u><u></u></p>
</div>
</div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<p class="MsoNormal">On Tue, Jan 31, 2012 at 12:35 PM, Josh Hursey <<a href="mailto:jjhursey@open-mpi.org" target="_blank">jjhursey@open-mpi.org</a>> wrote:<u></u><u></u></p>
<p class="MsoNormal">Actually the following thread might be more useful for this discussion:<u></u><u></u></p>
<div>
<p class="MsoNormal"> <a href="http://lists.mpi-forum.org/mpi3-ft/2011/12/0940.php" target="_blank">http://lists.mpi-forum.org/mpi3-ft/2011/12/0940.php</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">The example did not come out well in the archives, so below is the diagram again (hopefully that will work):<u></u><u></u></p>
</div>
<div>
<div>
<p class="MsoNormal">So the process stack looks like:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">P0 P1<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">--------------- ----------------<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Dup(comm[X-1]) Dup(comm[X-1])<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">MPI_Allreduce() MPI_Allreduce()<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Dup(comm[X]) -> Error<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> -> Error<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So should P1 be required to call Dup(comm[X])?<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#888888"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#888888">-- Josh<u></u><u></u></span></p>
</div>
<div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
<div>
<p class="MsoNormal">On Wed, Jan 25, 2012 at 5:08 PM, Josh Hursey <<a href="mailto:jjhursey@open-mpi.org" target="_blank">jjhursey@open-mpi.org</a>> wrote:<u></u><u></u></p>
<p class="MsoNormal">The current proposal states that MPI object creation functions (e.g., MPI_Comm_create, MPI_Win_create, MPI_File_open):<u></u><u></u></p>
<div>
<p class="MsoNormal">-------------------<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">All participating communicator(s) must be collectively active before calling any communicator creation operation. Otherwise, the communicator creation operation will uniformly raise an error code of the class MPI_ERR_PROC_FAIL_STOP.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">If a process failure prevents the uniform creation of the communicator then the communicator construction operation must ensure that the communicator is not created, and all alive participating processes will raise an error code of the
class MPI_ERR_PROC_FAIL_STOP. Communicator construction operations will match across the notification of a process failure. As such, all alive processes must call the communicator construction operations the same number of times regardless of whether the emergent
process failure makes the call irrelevant to the application.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">-------------------<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So there are three points here:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> (1) That the communicator must be 'collectively active' before calling the operation,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> (2) Uniform creation of the communication object, and<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> (3) Creation operations match across process failure.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Point (2) seems to be necessary so that all processes only ever see a communication object that is consistent across all processes. This implies a fault tolerant agreement protocol (group membership).<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">There was a question about why point (3) is necessary. We (Darius, UTK, and I) discussed this on 12/12/2011, and I posted my notes on this to the list:<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> <a href="http://lists.mpi-forum.org/mpi3-ft/2011/12/0938.php" target="_blank">
http://lists.mpi-forum.org/mpi3-ft/2011/12/0938.php</a><u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Looking back at my written notes, they don't have much more than that to add to the discussion.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So the problem with (3) seemed to arrise out of the failure handlers, though I am not convinced that they are strictly to blame in this circumstance. It seems that the agreement protocol might be factoring into the discussion as well since
it is strongly synchronizing, if not all processes call the operation how does it know when to bail out. The peer processes are (a) calling that operation, (b) going to call it but have not yet, or (c) will never call it because they decided independently not
to base on a locally reported process failure.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">It seems that the core problem has to do with when to break out of the collective creation operation, and when to restore matching.<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">So should re reconsider the restriction on (3)? More to the point, is it safe to not require (3)?<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#888888"><u></u> <u></u></span></p>
</div>
<div>
<p class="MsoNormal"><span style="color:#888888">-- Josh<u></u><u></u></span></p>
</div>
<div>
<div>
<p class="MsoNormal"><span style="color:#888888"><u></u> <u></u></span></p>
</div>
<p class="MsoNormal"><span style="color:#888888">-- <br>
Joshua Hursey<br>
Postdoctoral Research Associate<br>
Oak Ridge National Laboratory<br>
<a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><u></u><u></u></span></p>
</div>
</div>
<p class="MsoNormal"><br>
<br clear="all">
<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal">-- <br>
Joshua Hursey<br>
Postdoctoral Research Associate<br>
Oak Ridge National Laboratory<br>
<a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><u></u><u></u></p>
</div>
</div>
</div>
</div>
<p class="MsoNormal"><br>
<br clear="all">
<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal">-- <br>
Joshua Hursey<br>
Postdoctoral Research Associate<br>
Oak Ridge National Laboratory<br>
<a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><u></u><u></u></p>
</div>
</div></div></div>
</div>
</div>
<br>_______________________________________________<br>
mpi3-ft mailing list<br>
<a href="mailto:mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><br>
<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><br></blockquote></div><br></div><br clear="all"><div><br></div>-- <br>Joshua Hursey<br>
Postdoctoral Research Associate<br>Oak Ridge National Laboratory<br><a href="http://users.nccs.gov/~jjhursey" target="_blank">http://users.nccs.gov/~jjhursey</a><br>