[Mpi3-ft] Nonblocking Process Creation and Management

Josh Hursey jjhursey at open-mpi.org
Thu Apr 29 10:17:30 CDT 2010

Thanks for the feedback, more below.

On Apr 27, 2010, at 11:53 PM, Solt, David George wrote:

> This document says:
> 	"This call starts a nonblocking variant of MPI_COMM_SPAWN. It is  
> erroneous
> associated with
> 	the MPI_ICOMM_SPAWN operation.
> is marked
> 	for cancellation using MPI_CANCEL, then it must be the case that  
> either the
> 	operation completed ...."
> Maybe the first paragraph was meant for deletion?

Thanks for catching that. Fixed now.

>  It conflicts with the 2nd one.
> 	"It is a valid behavior for the MPI_COMM_ACCEPT call to
> 	timeout with accepting connections, and should not be considered
> 	an error."
> I think 'with' should be 'without' in the above text.

Yes it should be 'without'. Good catch. Thanks

> I haven't been focused on any discussions about cancel, so apologies  
> if my concerns have already been discussed.  We have only  
> implemented non-blocking MPI_Icomm_accept for a  singleton accepting  
> from another singleton so far since that's what we needed.  As I try  
> to expand this to the general case I'm getting concerned about the  
> difficulty of canceling any of these non-blocking collectives.

Yeah. It is going to be a point of concern, if not only for the fact  
that we are expanding the use of MPI_Cancel to include these  
collective operations and not others (before cancel was only defined  
for point-to-point communication).

> I doubt that we can rely on non-blocking collectives to implement
> MPI_Icomm_join, MPI_Icomm_accept, etc. since the non-blocking  
> collective group has decided they can't be cancelled (at least not  
> before the point where the operation becomes un-cancellable).  If  
> implementers internally allow cancelling collectives then why not  
> expose cancel of NB collectives to the users?  Has this group talked  
> to the NB collectives group about why cancel was not allowed for NB  
> collectives.  I am aware that the need to cancel an outstanding  
> MPI_Comm_accept was a motivator for this whole strategy.  We may  
> want some restriction, such as MPI_Cancel can only be done at the  
> root of an MPI_Icomm_accept or connect.

Applying MPI_Cancel more broadly to the entire class of NB collectives  
becomes difficult since many of the collectives have a 'leave-early'  
type of semantic which makes it unclear what cancel could mean if some  
of the processes have already left the collective (maybe it just  
becomes non-cancelable at that point). I agree that we should talk  
with the collectives group about this a bit more. I hope we will get a  
chance to next week at the forum.

The one interesting aspect of the collective operations in the  
dynamics chapter is that they are all communicator creation  
collectives. So they have an implicit barrier at the bottom to either  
commit/create the new communicator or not as decided by the explicit  
or implicit 'root' of the collective. So the root can decided when/if  
the communicator creation call should be canceled while creating it.

So we discussed whether or not only the root should be allowed to call  
cancel. After discussion in the last meeting and in the teleconfs, we  
felt that it was relatively easy for the MPI implementation to provide  
the MPI_Cancel from any participating process. A non-root process  
participating in the MPI_Connect (for example) could call MPI_Cancel  
which will relay a message to the root of the collective indicating  
that rank X requested that the operation be canceled. The root then  
decides whether or not to cancel the operation, and sends out the  
decision to all participating processes (which are waiting for a  
decision to determine if they have created a communicator or not).

So that was the line of reasoning that brought us to the current  
MPI_Cancel semantics for the collective operations in the dynamics  
chapter. I don't know if this line of reasoning will apply more  
broadly to the collectives chapter for NB collectives.

> Is the following code legal MPI code?
> {
> 	MPI_Init(..);
> 	....
> 	If (rank == size-1) {
> 		MPI_Icomm_accept(......, 0, comm, &newcomm, &req);
> 		MPI_Cancel(&req);
> 		MPI_Wait(&req);
> 	}
> 	MPI_Finalize();
> }

Yes, but with one modification (add MPI_Test_cancelled):
	If (rank == size-1) {
		MPI_Icomm_accept(......, 0, comm, &newcomm, &req);
		MPI_Cancel(&req); // Local operation (relayed to root as needed)
		MPI_Wait(&req, &status);
                 MPI_Test_cancelled(status, &flag)
                 if( flag ) { // Successfully Canceled }
                 else       { // Cancel failed, Accept completed the  
connection }

> If it is legal, I think cancel is going to add a huge amount of  
> overhead.  If it is not legal, then things become easier, but I  
> think the additional overhead of ensuring consensus between ranks  
> will outweigh any performance gains we think these routines will  
> provide.

We already need to ensure consensus between the ranks in the 'comm'  
communicator in order to commit/create the communicator in the first  
place (if not only for the cid allocation). So this is already a part  
of the blocking versions of these calls. I don't see how adding cancel  
will add any more overhead since at the bottom of the operation we are  
just either sending 'success' or 'cancelled' (or 'failed') to all  
participating processes.

However I might be misunderstanding the problem that you are trying to  
highlight in your example. If so, can you elaborate a bit more?

> I'm against the "non-blocking version of everything" approach.  I  
> understand the desire for orthogonality, but this is MPI Standard  
> bloating to me.  I pretty much bound by my funding management to  
> vote for any proposal that includes MPI_Icomm_accept or any other  
> functionality we need.  However, I really don't like it.  Does  
> anyone want MPI_Iunpublish_name?   What % of applications call  
> MPI_Unpublish_name?  What % of their execution time is spent in  
> MPI_Unpublish_name?  I think we should look at each call and  
> consider each individually rather than just make non-blocking  
> versions of all of them.

I agree that we should consider each of the functions to determine if  
they have merit to become nonblocking. If we can think of a reasonable  
use case for it then I am inclined to add it to the standard. Though I  
do not agree with the argument that just because most or under a  
certain percentage of applications do not currently use a function/ 
feature of the standard that it should be excluded from consideration.  
Take for example MPI_Comm_accept(), many applications that may have  
wanted to use it chose not to because it did not have a non-blocking  
version (leading to non-standard nonblocking implementations of it).  
It is impossible for us to quantify the number of users that do not  
use a function because it is perceived as deficient. What we can do is  
if we can convince ourselves that there is a real use case for the  
function, then we should be open to including it. If we cannot then we  
should leave it out of the standard for now.

I think that is coming a long way around to overall agreeing with you  
that we should consider the usefulness of these functions before  
pressing them into the standard. In the first draft of this proposal I  
created a nonblocking version of all the functions that I could  
convince myself of use cases for (particularly those functions that  
rely on a external resource for completion). But I am open to reducing  
the proposal if folks feel that some of the nonblocking functions are  
unnecessary. Slightly unrelated, but I am however not open to the idea  
of pushing for deprecation of any functions as that should be  
considered in a completely separate proposal.

As a side note, the reason for the nonblocking 'publsh' and  
'unpublish' is because these operations may (often) rely on an  
external name server. The name server may be slow, so having the  
nonblocking versions limits the impact of the slow server on the  
application. The 'timeout' keyword has a similar purpose, but  
primarily intended for the blocking versions. So the question is can  
we convince ourselves that there is a use case for an application to  
prefer the nonblocking version over a blocking+timeout version.

Thanks for the feedback :)

-- Josh

> Dave S.
> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org 
> ] On Behalf Of Josh Hursey
> Sent: Wednesday, April 21, 2010 9:44 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] Nonblocking Process Creation and Management
> I updated the Nonblocking Process Creation and Management proposal on
> the wiki:
>   https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/Async-proc-mgmt
> The new version reflects conversations over the past couple months
> about the role of MPI_Cancel in the various nonblocking interfaces,
> and some touchups on the timeout language.
> I think the proposal to be pretty stable at the moment. If you have
> any issues with the current proposal let me know either on the list or
> the teleconf.
> Thanks,
> Josh
> On Jan 12, 2010, at 5:03 PM, Josh Hursey wrote:
>> I extended and cleaned up the Nonblocking Process Creation and
>> Management proposal on the wiki:
>>  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/Async-proc-mgmt
>> I added the rest of the nonblocking interface proposals, and touched
>> up some of the language. I do not have an implementation yet, but
>> will work on that next. There are a few items that I need to refine
>> a bit still (e.g., MPI_Cancel, mixing of blocking and nonblocking),
>> but this should give us a foundation to start from.
>> I would like to talk about this next week during our working group
>> slot at the MPI Forum meeting.
>> Let me know what you think, and if you see any problems.
>> Thanks,
>> Josh
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

More information about the mpiwg-ft mailing list