[Mpi3-ft] Review comments

Thu Feb 17 09:19:26 CST 2011

Comments in-line below.

On Feb 16, 2011, at 8:58 PM, Adam T. Moody wrote:

> Another iteration of feedback (see inlined comments below).
> -Adam
> 
> Joshua Hursey wrote:
> 
>> Thanks for the notes. I'm working on adding them to the document, but I may not get it uploaded until Monday. More notes below.
>> 
>> On Feb 8, 2011, at 1:15 PM, Moody, Adam T. wrote:
>> 
>> 
>> 
>>> Hi Josh,
>>> Nice work putting all of this together.  Here are some comments after reviewing the latest stabilization text.
>>> -Adam
>>> 
>>> 1) In 2.4, define terms "unrecognized failure", "recognized failure" and "failed rank" up front.  To new readers, later on it not's clear what "failed rank" means.
>>> 
>>> 
>> 
>> I agree that this would help.
>> 
>> 
>> 
>>> 2)  In 2.8, change "the application" to "a process" (semantics) and "terminates" to "terminated" (typo):
>>> 
>>> "If the application receives notification of a process failure, the application can be assured that the specified process (identified by rank and generation) is terminates and is no longer participating in the job."
>>> 
>>> One may misinterpret "the application" to mean "all processes in the application", which is incorrect.  Maybe replace this with something like the following:
>>> 
>>> "When a process receives notification that another process has failed, it may continue under the assumption that the failed process is no longer participating in the job.  Even in cases where the failed process may not have terminated, the MPI implementation will ensure that no data is delivered to or from the failed process."
>>> 
>>> 
>> 
>> I think that the revision is more precise.
>> 
>> 
>> 
>>> 3) Examples of 3.10
>>> 3a) In the first example, there is no status object for MPI_Send.
>>> 
>>> 
>> 
>> Got it. a few of the other examples in this section have the same problem (cut-and-paste errors). I'll fix them both here and in the User's Document.
>> 
>> 
>> 
>>> 3b) Some of the ANY_SOURCE examples are non-deterministic, e.g., some of the receive calls could return successfully because they also match the send from rank 3.  In this case, the explicit receive for rank 3 will block and the application will not be notified of the rank 2 failure.
>>> 
>>> 
>> 
>> Good catch. Since it is for illustration purposes only, I may drop the last recv from peer=3. Then add an example of how MPI_Comm_validate() can be used to figure out that the failed rank was 2.
>> 
>> 
>> 
>> 
>>> 3c) To be again emphasize the local nature of process failure, it would be good to note before the examples that a failure denoted in the code represents the time at which the underlying MPI library of the calling process has been notified of the failure of specified rank.  Or instead of "/*** Rank 2 failed ***/" change the comment to read "/*** MPI library of Rank 0 is notified that Rank 2 failed ***/" (or something like that).
>>> 
>>> 
>>> 
>> 
>> I'll add a note to this effect at the front of the examples section(s). That will help keep the pseudo-ish code from getting too verbose.
>> 
>> 
>> 
>> 
>>> 4) Question about whether to leave receive buffer elements for failed processes to be "undefined" or "not modified".  Allowing them to be undefined permits certain optimizations in MPI, however, it requires the application to keep a list of failed procs, whereas with not modified, this behaves more like PROC_NULL does today and the app can rely on MPI to keep track of the failed processes (by initializing the receive buffer with NULL data).
>>> 
>>> 
>> 
>> Per the MPI Forum FT WG meeting, I'll add a discussion point to the wiki so we can keep talking about it on the calls.
>> 
>> 
>> 
>>> 5) Question about whether to require all processes to continue calling all collectives even when a failure has occurred during an earlier collective.
>>> 
>>> 
>> 
>> Currently all processes are not required to continue calling all collectives after a failure has occurred. Such a requirement would run counter to a user's intuition about handing error codes in functions, thus setting an odd precedent in the error handling model. If there is a strong case for why this would be needed then we should consider it further, but I need a bit more convincing. (Maybe you can start a new thread on this topic for further discussion).
>> 
>> 
> I'm leaning toward the semantics of the current proposal.  I haven't
> fully thought this through, but I didn't want to drop this item from
> discussion.

It's a good point, and may need to be clarified.

> 
>> 
>> 
>>> 6) In COMM_VALIDATE, want to set outcount to actual number needed in cases when incount is too small?  Then app knows to call again with correct size for incount (unless there is another failure in between, in which case, the app can iterate again).
>>> 
>>> 
>> 
>> I'll spin up a separate email thread about this topic for further discussion.
>> 
>> 
>> 
>>> Suggestion that the two variables could even be combined into a single INOUT.
>>> 
>>> 
>> 
>> There is precedent to have them as separate variables (MPI_WAITSOME, MPI_TESTSOME), so that is why we set the interface up like this. I guess I need more convincing that this is a necessary feature of the interface.
>> 
>> 
> Following current precedent is good enough for me.

Cool.

> 
>> 
>> 
>>> 7) In GROUP_VALIDATE, I take it that the process must list the ranks it wants to know about in the RANK_INFO list.  However, do we have the user specify ranks via semi-opaque structures in other MPI functions, or should we just have the user specify ranks in a list of ints and make the RANK_INFO objects pure output?
>>> 
>>> 
>> 
>> So ranks are contiguous starting from 0 in the group. So the ranks returned by the MPI_Rank_info object will reference the ranks in the group passed to the function. So the ranks are specified when they acquire the group.
>> 
>> Maybe I misunderstood your question.
>> 
>> 
> No, I missed the point on this function.  For some reason, I was
> thinking that, like GROUP_VALIDATE_RANK, the caller specified the set of
> ranks it wanted to know about.  After re-reading, I understand, so
> nevermind.

Ok

> 
>> 
>> 
>>> 8) Do we really want / need to provide group query functions, since users can not clear failures in a group anyway?
>>> 
>>> 
>> 
>> This is useful primarily for File Handles and Windows where you can access the group associated with those handles, but not the original communicator used to create them (and by extension when creating sub-groups for One-sided epochs).
>> 
>> So even though they cannot 'recognize' the failed process, it may be enough to know that they are either active or failed.
>> 
>> 
> Instead of using GROUP_VALIDATE calls, I vote for using the WIN_VALIDATE
> / FILE_VALIDATE calls for this.  I opt for removing the group-specific
> calls unless we find that we really need them elsewhere.

So when re-reading the text I came to a similar realization - that we need WIN and FILE versions of the COMM_VALIDATE and COMM_VALIDATE_RANK functions since groups do not distinguish between recognized and unrecognized failures.

I kept the Group specific options in there for now since I am not completely convinced they are not useful. Since groups are flying around throughout the standard, it might be useful for an application to check for alive/failed ranks in a group object that they created some time earlier. It is not a strong use case, so we might decide to remove them before the final proposal if we convince ourselves sufficiently. I'll keep them in there for now, so we don't loose track of them.

> 
>> 
>> 
>>> 9) Want to require communicator creation calls to complete successfully everywhere?  Can we instead just treat them as normal collectives by surronding them with comm_validate_all checks?
>>> 
>>> 
>> 
>> So we need the communicator creation calls to provide strict guarantees since we don't want to return an invalid object somewhere and a valid object elsewhere. It is mostly a usability argument, but also complexity of the interface. Since if we allow mixed 'valid' and 'invalid' versions of the communicator to exist at the same time, what mechanisms do we provide to the user to resolve this problem.
>> 
>> 
> I see your point.

The MPI_Comm_validate_all() email thread is related to this topic (for those folks following this thead, but not the other).

> 
>> Internally there are a few ways to implement this. The easiest technique is to surround the old communicator creation call with calls to MPI_Comm_validate_all(), and loop on failure. However, there better performing ways to implement this operation. By pushing the requirement to the MPI implementation, we should be able to optimize this. But it does add more overhead to communicator creation calls.
>> 
>> 
>> 
>>> 10) Checks for ERR_FAIL_STOP in irecv and isend calls in library examples, but I think early we say that errors will not be returned for non-blocking start calls.
>>> 
>>> 
>> 
>> Good catch. I'll fix.
>> 
>> 
>> 
>>> 11) MPI_ERRHANDLER_FREE -- error if freeing a predefined errhandler.  At first, this seemed to be a good idea to me, but there is some inconsistency here, especially if you provide the user a way to compare errhandlers for equality.  For example, what if the errhandler returned by GET_ERRHANDLER is a predefined errhandler?  According to GET_ERRHANDLER, we always need to free the returned handle, but now we're saying we can't free predefined handlers.  It seems ugly to require the user to check whether the handler is a predefined handler before calling ERRHANDLER_FREE.  A similar problem: do we define what happens to MPI_COMM_FREE(MPI_COMM_WORLD)?
>>> 
>>> 
>> 
>> Interesting. I think that since the errhandler returned from Get_errhandler() must be freed, this implies that it is valid to pass predefined error handlers to errhandler_free().
>> So the following is valid (taken from Open MPI source comment, referencing MPI-2 errata):
>> int main() {
>> MPI_Errhandler errhdl;
>> MPI_Init(NULL, NULL);
>> MPI_Comm_get_errhandler(MPI_COMM_WORLD, &errhdl);
>> MPI_Errhandler_free(&errhdl);
>> MPI_Finalize();
>> return 0;
>> }
>> 
>> But what happens if the user free's the handle twice (should we return MPI_ERR_ARG when the ref. count is 1?)
>> int main() {
>> MPI_Errhandler errhdl;
>> MPI_Init(NULL, NULL);
>> MPI_Comm_get_errhandler(MPI_COMM_WORLD, &errhdl);
>> MPI_Errhandler_free(&errhdl); /* Success */
>> MPI_Errhandler_free(&errhdl); /* MPI_ERR_ARG */
>> MPI_Finalize();
>> return 0;
>> }
>> 
>> So I say we change this to something like: it is valid to pass predefined error handlers to the errhandler free function only if it was previously returned by a call to get_errhandler. The errhandler_free() call will return MPI_ERR_ARG if the handler has been deallocated, or in the case of predefined error handlers, the number of references would be reduced to such a number that the predefined error handlers would be freed.
>> 
>> 
>> To the best of my knowledge, I don't think that the behavior is clearly defined for MPI_Comm_free(MPI_COMM_WORLD/SELF/NULL) either. Can someone double check that?
>> 
>> 
> I did find this on p 13, lines 18-20:
> 
> "MPI provides certain predefined opaque objects and predefined, static
> handles to these objects.  The user must not free such objects.  In C++,
> this is enforced by declaring the handles to these predefined objects to
> be static const."
> 
> So this says that a user should not call MPI_Comm_free(MPI_COMM_WORLD).
> Similarly, MPI_Errhandler_free(MPI_ERRORS_ARE_FATAL) is not allowed.  On
> the otherhand, users must free objects created by an MPI call.  So I
> take this to mean that users must free a handler object returned by
> GET_ERRHANDLER, regardless of what it represents.  If the handler is set
> to a predefined handle like MPI_ERRORS_ARE_FATAL, it is ok to free the
> returned object.  Page 277, lines 35-38 also specify this:
> 
> "MPI_{COMM,WIN,FILE}_GET_ERRHANDLER behave as if a new error handler
> object is created.  That is, once the error handler is no longer needed,
> MPI_ERRHANDLER_FREE should be called with the error hadler returned from
> MPI_ERRHANDLER_GET or MPI_{COMM,WIN,FILE}_GET_ERRHANDLER to mark the
> error handler for deallocation.  This provides behavior similar to that
> of MPI_COMM_GROUP and MPI_GROUP_FREE.
> 
> All of that said, I don't think we need to add the text you have listed
> under MPI_ERRHANDLER_FREE.

I added the following bit of text in the current proposal:
------------
If {{{MPI_ERRHANDLER_FREE}}} is called with a predefined error handler it return successfully unless such an operation would result in the deallocation of the predefined error handler in which case it will return with an error code in the class {{{MPI_ERR_ARG}}}.

  ''Rationale'': It is required that the error handler returned by {{{MPI_COMM_GET_ERRHANDLER}}} must be freed with a call to {{{MPI_ERRHANDLER_FREE}}}. The error handler returned might have been a predefined error handler, in which case it should be valid to call {{{MPI_ERRHANDLER_FREE}}} on this predefined object.
------------

I did forget to add the example from my earlier email, which might be useful.

But, per your comment, I wonder if we need anything at all. I'll add a discussion point since I would like to bring it up on the call to get a larger number of eyes on it.

> 
>> Additionally, MPI_ERRHANDLER_GET is deprecated, but is used in a few places. So we should fix this (I'll look into a ticket).
>> 
>> 
>> 
>>> 12) MPI_ERRHANDLER_COMPARE -- do we need this?
>>> 
>>> 
>> 
>> Originally we figured that this would be useful for a library to determine if the currently set error handler is MPI_ERRORS_ARE_FATAL, MPI_ERRORS_RETURN, or something else. Since the errorhandler is an opaque object it is not valid for the user to directly compare the values. So if the ability to check this value is important, then I don't see of another way to provide this functionality to the user.
>> 
>> 
> Hmm, I see your point, but this still feels strange to me.  I'd like to
> avoid it if we can find a better way.  If we drop it, then the
> application will need some other method to inform the library that the
> error handler is set to be non-fatal.  I don't see a clean way to do
> this, though.

Is your problem with it that it may be difficult to use this function to compare non-predefined error handlers? We have a similar functionally for comparison with other opaque objects, like groups - but for groups the use case is pretty well established.

> 
>>> 13) Many of the error classes seem to be a superset of what is needed in this proposal, e.g., MPI_ERR_CANNOT_CONTINUE, MPI_ERR_FN_CANNOT_CONTINUE, etc.  I'd move these to text in a separate proposal.
>>> 
>>> 
>> 
>> There is a discussion point as to the need for the CANNOT_CONTINUE error classes in the document. They are used in the I/O and One-sided chapters so that is why they are in this document, and not in a separate proposal.
>> 
>> 
>> 
>>> 14) When specifying a rank value in the context of an intercommunicator, that rank always refers to a process in the remote group.  Thus, for an intercommunicator, COMM_VALIDATE_CLEAR cannot be used to clear ranks in the local group.  Is there any problem there?  Maybe not since you can't send pt-2-pt to processes in the local group?  What about collectives?
>>> 
>>> 
>> 
>> Interesting point. I'll have to think about this one a bit more. We might need the ability to specify local and remote groups for validation. Let me mull it over a bit more and see if I can come up with some motivating examples for further discussion (unless you have some handy).
>> 
>> My inclination is to say that the rank arguments specify the remote rank. If the local rank information by accessing and querying the local group returned by MPI_Comm_group. Since you cannot use P2P communication with the local group, then locally clearning the failure to the local group is not useful. For collective operations the MPI_Comm_validate will return a total count of just those failures in the remote group.
>> 
>> All and all, there needs to be some clarification here, and probably an example.
>> 
>> 
>> 
>>> 15) I'd move the timeout discussion to a separate proposal page.  It seems to be orthogonal.
>>> 
>>> 
>> 
>> Yeah, I might re-extract it, and try to move it forward as a separate proposal again.
>> 
>> 
> BTW, I think it's much simpler and should be easier to progress this
> forward.

Yeah, the proposal that it is currently attached to has some rough points (like collective cancel) that need more polishing. Nothing insurmountable, but some work left to do.

I am actually looking for another advocate to help me push it over the goal line since most of my time is devoted to the run-through and recovery proposals. So if you know of anyone ;)

> 
>> 
>> 
>>> 16) For one-sided, why not handle put / gets to recognized failed procs like communication with PROC_NULL?
>>> 
>>> 
>> 
>> Since I was unsure if we want to be able to recognize failure in a window, I just left the interactions as if they were always communicating with an unrecognized failure. If we add the ability to collectively validate the window, then a Put/Get to a recognized failure would behave like the target was MPI_PROC_NULL.
>> 
>> 
> Since we have WIN_VALIDATE calls, you could update the proposal to treat
> a put / get as like a put / get to PROC_NULL.

I added a line to the front matter of that section instead of repeating it for put/get/accumulate:
 "Communication with recognized failed processes in the group associated with the current epoch on the window will have MPI_PROC_NULL semantics."

> 
>> As a side node, Put/Get are meant to be non-blocking, so I modeled them, some what, after MPI_Isend/MPI_Irecv. Unlike the non-blocking P2P operations, I thought it might be useful if Put/Get immediately return an error if they are communicating with a known failed process instead of having to wait for the epoch to finish. But because it is a change from the MPI_Isend/MPI_Irecv semantics, maybe we should change it back. Do others have thoughts on this point?
>> 
>> 
> The one big difference I see is that an error at the epoch boundary
> cannot be associated with an individual put / get operation.  With
> pt-2-pt, we have request and status objects that correspond to each
> communication call.

Yeah. This is really a shortcoming of the existing One-sided interfaces. I added a note in "11.4 Synchronization Calls" highlighting this issue. So what we have is really a work around because the existing interface does not provide us enough information to associate errors with the particular calls (unless those calls return immediately an error, which is not always possible). So instead of fixing the calls, I worked around them. Mostly because there are new RMA interfaces coming in, and I was hoping that they would address this issue there.

That being said, I have not looked at the new RMA proposal yet to see if we need to push this point or not. Is there anyone that has the time to look at it before the next MPI Forum meeting so we can work more closely with them on this?

Thanks,
Josh

> 
>> 
>> 
>>> And since we can't use MPI_COMM_VALIDATE_ALL on a window, how about something like a MPI_WIN_VALIDATE_ALL instead?
>>> 
>>> 
>> 
>> There is a note in there about adding a MPI_WIN_VALIDATE_ALL function, and if the group would think it useful for an application. It's under the "11.6.3 Window Validation (New Section)" heading.
>> 
>> 
>> Thanks for the feedback, keep it coming.
>> 
>> -- Josh
>> 
>> 
>> 
>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> 
>>> 
>>> 
>> 
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey