[Mpi3-ft] Review comments
Joshua Hursey
jjhursey at open-mpi.org
Fri Feb 11 15:39:23 CST 2011
Thanks for the notes. I'm working on adding them to the document, but I may not get it uploaded until Monday. More notes below.
On Feb 8, 2011, at 1:15 PM, Moody, Adam T. wrote:
> Hi Josh,
> Nice work putting all of this together. Here are some comments after reviewing the latest stabilization text.
> -Adam
>
> 1) In 2.4, define terms "unrecognized failure", "recognized failure" and "failed rank" up front. To new readers, later on it not's clear what "failed rank" means.
I agree that this would help.
>
> 2) In 2.8, change "the application" to "a process" (semantics) and "terminates" to "terminated" (typo):
>
> "If the application receives notification of a process failure, the application can be assured that the specified process (identified by rank and generation) is terminates and is no longer participating in the job."
>
> One may misinterpret "the application" to mean "all processes in the application", which is incorrect. Maybe replace this with something like the following:
>
> "When a process receives notification that another process has failed, it may continue under the assumption that the failed process is no longer participating in the job. Even in cases where the failed process may not have terminated, the MPI implementation will ensure that no data is delivered to or from the failed process."
I think that the revision is more precise.
>
> 3) Examples of 3.10
> 3a) In the first example, there is no status object for MPI_Send.
Got it. a few of the other examples in this section have the same problem (cut-and-paste errors). I'll fix them both here and in the User's Document.
> 3b) Some of the ANY_SOURCE examples are non-deterministic, e.g., some of the receive calls could return successfully because they also match the send from rank 3. In this case, the explicit receive for rank 3 will block and the application will not be notified of the rank 2 failure.
Good catch. Since it is for illustration purposes only, I may drop the last recv from peer=3. Then add an example of how MPI_Comm_validate() can be used to figure out that the failed rank was 2.
> 3c) To be again emphasize the local nature of process failure, it would be good to note before the examples that a failure denoted in the code represents the time at which the underlying MPI library of the calling process has been notified of the failure of specified rank. Or instead of "/*** Rank 2 failed ***/" change the comment to read "/*** MPI library of Rank 0 is notified that Rank 2 failed ***/" (or something like that).
>
I'll add a note to this effect at the front of the examples section(s). That will help keep the pseudo-ish code from getting too verbose.
> 4) Question about whether to leave receive buffer elements for failed processes to be "undefined" or "not modified". Allowing them to be undefined permits certain optimizations in MPI, however, it requires the application to keep a list of failed procs, whereas with not modified, this behaves more like PROC_NULL does today and the app can rely on MPI to keep track of the failed processes (by initializing the receive buffer with NULL data).
Per the MPI Forum FT WG meeting, I'll add a discussion point to the wiki so we can keep talking about it on the calls.
>
> 5) Question about whether to require all processes to continue calling all collectives even when a failure has occurred during an earlier collective.
Currently all processes are not required to continue calling all collectives after a failure has occurred. Such a requirement would run counter to a user's intuition about handing error codes in functions, thus setting an odd precedent in the error handling model. If there is a strong case for why this would be needed then we should consider it further, but I need a bit more convincing. (Maybe you can start a new thread on this topic for further discussion).
>
> 6) In COMM_VALIDATE, want to set outcount to actual number needed in cases when incount is too small? Then app knows to call again with correct size for incount (unless there is another failure in between, in which case, the app can iterate again).
I'll spin up a separate email thread about this topic for further discussion.
> Suggestion that the two variables could even be combined into a single INOUT.
There is precedent to have them as separate variables (MPI_WAITSOME, MPI_TESTSOME), so that is why we set the interface up like this. I guess I need more convincing that this is a necessary feature of the interface.
>
> 7) In GROUP_VALIDATE, I take it that the process must list the ranks it wants to know about in the RANK_INFO list. However, do we have the user specify ranks via semi-opaque structures in other MPI functions, or should we just have the user specify ranks in a list of ints and make the RANK_INFO objects pure output?
So ranks are contiguous starting from 0 in the group. So the ranks returned by the MPI_Rank_info object will reference the ranks in the group passed to the function. So the ranks are specified when they acquire the group.
Maybe I misunderstood your question.
>
> 8) Do we really want / need to provide group query functions, since users can not clear failures in a group anyway?
This is useful primarily for File Handles and Windows where you can access the group associated with those handles, but not the original communicator used to create them (and by extension when creating sub-groups for One-sided epochs).
So even though they cannot 'recognize' the failed process, it may be enough to know that they are either active or failed.
>
> 9) Want to require communicator creation calls to complete successfully everywhere? Can we instead just treat them as normal collectives by surronding them with comm_validate_all checks?
So we need the communicator creation calls to provide strict guarantees since we don't want to return an invalid object somewhere and a valid object elsewhere. It is mostly a usability argument, but also complexity of the interface. Since if we allow mixed 'valid' and 'invalid' versions of the communicator to exist at the same time, what mechanisms do we provide to the user to resolve this problem.
Internally there are a few ways to implement this. The easiest technique is to surround the old communicator creation call with calls to MPI_Comm_validate_all(), and loop on failure. However, there better performing ways to implement this operation. By pushing the requirement to the MPI implementation, we should be able to optimize this. But it does add more overhead to communicator creation calls.
>
> 10) Checks for ERR_FAIL_STOP in irecv and isend calls in library examples, but I think early we say that errors will not be returned for non-blocking start calls.
Good catch. I'll fix.
>
> 11) MPI_ERRHANDLER_FREE -- error if freeing a predefined errhandler. At first, this seemed to be a good idea to me, but there is some inconsistency here, especially if you provide the user a way to compare errhandlers for equality. For example, what if the errhandler returned by GET_ERRHANDLER is a predefined errhandler? According to GET_ERRHANDLER, we always need to free the returned handle, but now we're saying we can't free predefined handlers. It seems ugly to require the user to check whether the handler is a predefined handler before calling ERRHANDLER_FREE. A similar problem: do we define what happens to MPI_COMM_FREE(MPI_COMM_WORLD)?
Interesting. I think that since the errhandler returned from Get_errhandler() must be freed, this implies that it is valid to pass predefined error handlers to errhandler_free().
So the following is valid (taken from Open MPI source comment, referencing MPI-2 errata):
int main() {
MPI_Errhandler errhdl;
MPI_Init(NULL, NULL);
MPI_Comm_get_errhandler(MPI_COMM_WORLD, &errhdl);
MPI_Errhandler_free(&errhdl);
MPI_Finalize();
return 0;
}
But what happens if the user free's the handle twice (should we return MPI_ERR_ARG when the ref. count is 1?)
int main() {
MPI_Errhandler errhdl;
MPI_Init(NULL, NULL);
MPI_Comm_get_errhandler(MPI_COMM_WORLD, &errhdl);
MPI_Errhandler_free(&errhdl); /* Success */
MPI_Errhandler_free(&errhdl); /* MPI_ERR_ARG */
MPI_Finalize();
return 0;
}
So I say we change this to something like: it is valid to pass predefined error handlers to the errhandler free function only if it was previously returned by a call to get_errhandler. The errhandler_free() call will return MPI_ERR_ARG if the handler has been deallocated, or in the case of predefined error handlers, the number of references would be reduced to such a number that the predefined error handlers would be freed.
To the best of my knowledge, I don't think that the behavior is clearly defined for MPI_Comm_free(MPI_COMM_WORLD/SELF/NULL) either. Can someone double check that?
Additionally, MPI_ERRHANDLER_GET is deprecated, but is used in a few places. So we should fix this (I'll look into a ticket).
>
> 12) MPI_ERRHANDLER_COMPARE -- do we need this?
Originally we figured that this would be useful for a library to determine if the currently set error handler is MPI_ERRORS_ARE_FATAL, MPI_ERRORS_RETURN, or something else. Since the errorhandler is an opaque object it is not valid for the user to directly compare the values. So if the ability to check this value is important, then I don't see of another way to provide this functionality to the user.
>
> 13) Many of the error classes seem to be a superset of what is needed in this proposal, e.g., MPI_ERR_CANNOT_CONTINUE, MPI_ERR_FN_CANNOT_CONTINUE, etc. I'd move these to text in a separate proposal.
There is a discussion point as to the need for the CANNOT_CONTINUE error classes in the document. They are used in the I/O and One-sided chapters so that is why they are in this document, and not in a separate proposal.
>
> 14) When specifying a rank value in the context of an intercommunicator, that rank always refers to a process in the remote group. Thus, for an intercommunicator, COMM_VALIDATE_CLEAR cannot be used to clear ranks in the local group. Is there any problem there? Maybe not since you can't send pt-2-pt to processes in the local group? What about collectives?
Interesting point. I'll have to think about this one a bit more. We might need the ability to specify local and remote groups for validation. Let me mull it over a bit more and see if I can come up with some motivating examples for further discussion (unless you have some handy).
My inclination is to say that the rank arguments specify the remote rank. If the local rank information by accessing and querying the local group returned by MPI_Comm_group. Since you cannot use P2P communication with the local group, then locally clearning the failure to the local group is not useful. For collective operations the MPI_Comm_validate will return a total count of just those failures in the remote group.
All and all, there needs to be some clarification here, and probably an example.
>
> 15) I'd move the timeout discussion to a separate proposal page. It seems to be orthogonal.
Yeah, I might re-extract it, and try to move it forward as a separate proposal again.
>
> 16) For one-sided, why not handle put / gets to recognized failed procs like communication with PROC_NULL?
Since I was unsure if we want to be able to recognize failure in a window, I just left the interactions as if they were always communicating with an unrecognized failure. If we add the ability to collectively validate the window, then a Put/Get to a recognized failure would behave like the target was MPI_PROC_NULL.
As a side node, Put/Get are meant to be non-blocking, so I modeled them, some what, after MPI_Isend/MPI_Irecv. Unlike the non-blocking P2P operations, I thought it might be useful if Put/Get immediately return an error if they are communicating with a known failed process instead of having to wait for the epoch to finish. But because it is a change from the MPI_Isend/MPI_Irecv semantics, maybe we should change it back. Do others have thoughts on this point?
> And since we can't use MPI_COMM_VALIDATE_ALL on a window, how about something like a MPI_WIN_VALIDATE_ALL instead?
There is a note in there about adding a MPI_WIN_VALIDATE_ALL function, and if the group would think it useful for an application. It's under the "11.6.3 Window Validation (New Section)" heading.
Thanks for the feedback, keep it coming.
-- Josh
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
More information about the mpiwg-ft
mailing list