[Mpi3-ft] Review comments

Tue Feb 8 12:15:27 CST 2011

Hi Josh,
Nice work putting all of this together.  Here are some comments after reviewing the latest stabilization text.
-Adam

1) In 2.4, define terms "unrecognized failure", "recognized failure" and "failed rank" up front.  To new readers, later on it not's clear what "failed rank" means.

2)  In 2.8, change "the application" to "a process" (semantics) and "terminates" to "terminated" (typo):

"If the application receives notification of a process failure, the application can be assured that the specified process (identified by rank and generation) is terminates and is no longer participating in the job."

One may misinterpret "the application" to mean "all processes in the application", which is incorrect.  Maybe replace this with something like the following:

"When a process receives notification that another process has failed, it may continue under the assumption that the failed process is no longer participating in the job.  Even in cases where the failed process may not have terminated, the MPI implementation will ensure that no data is delivered to or from the failed process."

3) Examples of 3.10
3a) In the first example, there is no status object for MPI_Send.
3b) Some of the ANY_SOURCE examples are non-deterministic, e.g., some of the receive calls could return successfully because they also match the send from rank 3.  In this case, the explicit receive for rank 3 will block and the application will not be notified of the rank 2 failure.
3c) To be again emphasize the local nature of process failure, it would be good to note before the examples that a failure denoted in the code represents the time at which the underlying MPI library of the calling process has been notified of the failure of specified rank.  Or instead of "/*** Rank 2 failed ***/" change the comment to read "/*** MPI library of Rank 0 is notified that Rank 2 failed ***/" (or something like that).

4) Question about whether to leave receive buffer elements for failed processes to be "undefined" or "not modified".  Allowing them to be undefined permits certain optimizations in MPI, however, it requires the application to keep a list of failed procs, whereas with not modified, this behaves more like PROC_NULL does today and the app can rely on MPI to keep track of the failed processes (by initializing the receive buffer with NULL data).

5) Question about whether to require all processes to continue calling all collectives even when a failure has occurred during an earlier collective.

6) In COMM_VALIDATE, want to set outcount to actual number needed in cases when incount is too small?  Then app knows to call again with correct size for incount (unless there is another failure in between, in which case, the app can iterate again).  Suggestion that the two variables could even be combined into a single INOUT.

7) In GROUP_VALIDATE, I take it that the process must list the ranks it wants to know about in the RANK_INFO list.  However, do we have the user specify ranks via semi-opaque structures in other MPI functions, or should we just have the user specify ranks in a list of ints and make the RANK_INFO objects pure output?

8) Do we really want / need to provide group query functions, since users can not clear failures in a group anyway?

9) Want to require communicator creation calls to complete successfully everywhere?  Can we instead just treat them as normal collectives by surronding them with comm_validate_all checks?

10) Checks for ERR_FAIL_STOP in irecv and isend calls in library examples, but I think early we say that errors will not be returned for non-blocking start calls.

11) MPI_ERRHANDLER_FREE -- error if freeing a predefined errhandler.  At first, this seemed to be a good idea to me, but there is some inconsistency here, especially if you provide the user a way to compare errhandlers for equality.  For example, what if the errhandler returned by GET_ERRHANDLER is a predefined errhandler?  According to GET_ERRHANDLER, we always need to free the returned handle, but now we're saying we can't free predefined handlers.  It seems ugly to require the user to check whether the handler is a predefined handler before calling ERRHANDLER_FREE.  A similar problem: do we define what happens to MPI_COMM_FREE(MPI_COMM_WORLD)?

12) MPI_ERRHANDLER_COMPARE -- do we need this?

13) Many of the error classes seem to be a superset of what is needed in this proposal, e.g., MPI_ERR_CANNOT_CONTINUE, MPI_ERR_FN_CANNOT_CONTINUE, etc.  I'd move these to text in a separate proposal.

14) When specifying a rank value in the context of an intercommunicator, that rank always refers to a process in the remote group.  Thus, for an intercommunicator, COMM_VALIDATE_CLEAR cannot be used to clear ranks in the local group.  Is there any problem there?  Maybe not since you can't send pt-2-pt to processes in the local group?  What about collectives?

15) I'd move the timeout discussion to a separate proposal page.  It seems to be orthogonal.

16) For one-sided, why not handle put / gets to recognized failed procs like communication with PROC_NULL?  And since we can't use MPI_COMM_VALIDATE_ALL on a window, how about something like a MPI_WIN_VALIDATE_ALL instead?