High Level Items: ----------------- - We should assess if an Eventually Perfect failure detector can work in the place of our current assumption of a Perfect failure detector. http://dx.doi.org/10.1145/226643.226647 - Many folks did not like the notion that each communication handle carried a local and global state with it. This is something the MPI standard has tried to avoid, and we should consider an alternative means to achieve the goals of the validation operations. It was suggested that we have a 'validation state' object that is returned that can be queried, and maintained separate from the communication handle it references. The open question is how this object can be used when working with point-to-point and collective routines. A concern about the memory consumption of carrying the local and global state on each communication handle was also brought forward. The notion of a separate validation object seemed to help address this issue. - The utility of the Group validation operations was brought into question a few times in two different ways. One side questioned that if there are validation operations for the other handles then the group operation serve no purpose, therefore why are they in the proposal. The other side questioned why do we not just only have the group validation operations and use those to inspect the true state of the processes (since each handle can give the group associated with it). This spoke to the broader question of why it is important to validate the state of processes in each communication object. Library support was cited, but the forum still felt the validation operations seemed cumbersome to program with. - MPI_Comm_kill was contentious at least in part because it offers functionality that seems beyond the scope of the proposal. In particular it is a mechanism for the application to remove a non-fail-stop process failure (e.g., byzantine, late arrival) from the computation. It was suggested that this be moved to a separate proposal focused on just this type of failure scenario. Additionally, the forum was concerned that the specification was not precise enough as to when this operation should return. For example, should the operation return once all processes know of the kill or once the local process knows of the failure. This also motions to the type of failure detector that we require - an eventually perfect failure detector would allow it to return once it is locally known versus a perfect failure detector which would imply that everyone should know before it returns. It was also mentioned that we might want to consider restricting the operation to be just over intracommunicators. If we allowed it over intercommunicators then it is possible that a malicious client connect to a server in an attempt to kill it. A counter point to this cited comm_spawn where the parents may want to kill a created child. This point should be discussed in the context of the new proposal focused just on MPI_Comm_kill. A concern about if MPI_Comm_kill requires progress on other processes to complete. - It was brought forward that MPI_ERR_RANK_FAIL_STOP need not be the first error returned by the MPI implementation, but that once it is then the MPI implementation must provide the specified semantics. It was concerning to the forum that this state between success and fail_stop is explicitly undefined by the standard. Thus it makes it difficult for a user to program against. After some discussion it was mentioned that an error code to indicate that 'this communication failed, but in the future might not fail' and another error code to indicate that 'this communication failed and will never succeed.' It was acknowledged that this is a process failure not communication failure proposal, but that the two issues are linked. As such it should be explicitly mentioned in the proposal that this scenario can happen, and the MPI standard does not specify the between states at this time. We have recently talked about this in the working group, and maybe should increase its priority. It was suggested that an implementation may be able to only provide the success and fail_stop transition if they held the error code until they determined that the process is truly fail-stop - retrying the communication until certain. - It was suggested that a simpler approach to process failures in communication handles might be worth exploring. Something that would disallow (either collective or any) communication to occur on a communicator that includes a failed process. This would force the application to always use a dense communicator that does not contain failed processes, and error otherwise. It was argued that this may be cumbersome for the programmer since ranks would change when recreating the dense communicator and this would impact data structures that depend on the rank value. However this concept might be a simpler, base-0 approach to consider. - Programability was brought forward as a point of concern. It was mentioned that more examples and application kernels are needed to help educate users (and developers) about how the fault tolerance features should and should not be used in real applications. The collectives examples in the current Run-Through Stabilization document are extremely helpful, but more examples are needed to help support users experimenting in this new dimension of MPI programming. This also speaks as little to establishing best practice, discussed below. - Finally the big discussion point at the end was is application level fault tolerance in MPI ready to be standardized. The argument is that the MPI standard should only standardize the 'current best practice' in the field. Since MPI implementations do not provide FT semantics/interfaces, and no applications are using these undefined FT semantics/interfaces then how can the MPI Forum be confident that this proposal represents the 'current best practice' for MPI. FT-MPI was cited as both an early implementation and something that some folks have picked up to experiment with. However, the current Run-Through Stabilization proposal is different from the FT-MPI semantics/interfaces. Additionally, it was not felt that FT-MPI established best practice, but rather one datapoint. We discussed that best practice in MPI is a chicken-and-egg problem in which the MPI implementors ask the application groups for what they would like to have specified, and the application groups ask what MPI can provide. One side must bring to the table a proposal in good faith to start the feedback cycle that would eventually lead to establishing a 'current best practice' for MPI. The Run-Through Stabilization proposal represented this first good faith effort, and applications are just starting to experiment with prototypes of it now. The problem voiced was in how to identify when the best practice has been established, and the desire to not wait a long time to standardize such functionality. It was mentioned that since the HPC community has grown to environments where fault tolerance is a top priority, possibly even higher than scalability or performance, then some urgency should be recognized. Further that fault tolerance has been cited as an issue for both for large exascale systems and long running mission critical system at smaller scales. It was mentioned that it seems as if we need a living document that the MPI Forum blesses that we can provide to applications to try out and send feedback. The blessing of the MPI Forum holds a lot of weight when working with application groups, and helps reinforce that their feedback will be accounted for in the eventually standardized document. The feeling was that the spirit of the Run-Through Stabilization proposal was on target, but some would like to see applications significantly using the proposal before feeling confident about voting on it for inclusion into the MPI standard. It was reinforced that many of the concepts that are present in the Run-Through Stabilization proposal are well established 'best practice' in other programming models (e.g., fault tolerant consensus as a building block). Some of these environments hide while others expose these building blocks and practices. It was noted that since the MPI model is different than many of these other models, that the best practice should be further reinforced by application use cases in MPI. Other Assorted Items: --------------------- - General question about performance in a range of network environments (since we only showed numbers for shared memory, and support TCP at the moment). - 2.2: This section limits the length of MPI identifiers to 30 characters or less. Some of the validate functions are more than 30 characters and should be fixed. - 8.2: In table with MPI_ERR_RANK_FAIL_STOP, change 'is failed' to 'has failed' - 8.2: Suggestion to change MPI_ERR_RANK_FAILED to MPI_ERR_PROC_FAILED. Which is 'better' to reference the rank in the communication handle, or the general concept of the process? - 8.3: "in the associated communicator' does not account for error handlers on windows or file handles. So change to something like 'in the associated communication handle' - 8.3: use 'associated' instead of 'specified' - 8.3: Noted that there was concern about the implication of this change versus how people conventionally think about this error handler. Users think that MPI_ERRORS_ARE_FATAL mean the whole job aborts regardless. The Open MPI and MPICH implementations explicitly rely on the second sentence of that error handler's description to mean that only the subset are aborted, though they typically/always abort the whole job. This clarification is meant to help that user acknowledge that interpreting ARE_FATAL as always job scoped is not portable. It was suggested to make MPI_ERRORS_ARE_FATAL be defined as job scoped, and create another error handler MPI_ERRORS_ARE_FATAL_TO_GROUP that has the standard behavior. A straw vote was taken to answer if the proposed clarification was sufficient (so do -not- add a new error handler) resulting in a 15 (yes) / 1 (no) / 0 (abstain) vote. - 8.3: What if an error is returned from a function that does not have an associated communication object (e.g., datatype creation), what happens? - 8.3.4: MPI_Errhandler_compare() 'exactly the same' is to ambiguous. Does this mean that the handles are the same or the contents of the handles. Suggest to have 'IDENT' mean same handle, 'SIMILAR' mean same errhandler function but different handle, 'UNEQUAL' otherwise. Suggestion to make this a separate ticket. - 8.7: Rank 0 and MPI_Finalize. Lots of concern about including the notion of MPI_Finalize succeeding even if a rank (in this case rank 0) has failed. Suggested to change 'unless it is failed' to something like 'assuming failure free execution' or 'assuming no processes fail'. Alternative suggestion of "beware that when considering a fault tolerant application, rank 0 might be dead, in which case the result of finalize is defined in 17.6.2." - 8.7: Rank 0 and MPI_Finalize. 'Advice to users' fix 'on or after the call the' - 8.7: Rank 0 and MPI_Finalize. 'Advice to users' after MPI_Finalize in first sentence add something like "but all processes have not been aborted' - 8.7: Rank 0 and MPI_Finalize. 'Advice to users' second sentence change to something like "failure detection only happens up to the point of MPI_Finalize" or "MPI can only detect failure up to the point of MPI_Finalize" or "no support for fault tolerance after MPI_Finalize". Add a reference to the FT chapter. Suggested that we move the note to the FT chapter with a backward reference to this section. After the first sentence add something like "and without aborting all the rest of the processes as described in Section 2.8." - Example 8.7: Add parenthetical about the warning in the advice to users. Something like "assuming the above situation does not happen" - 8.7: 'MPI_Abort' cannot be called over a window or file handle. What's up with that? - 16.3.6: Will need to add f082c and c2f08 commands when new fortran 2008 work comes into the standard. - 17.1: Clarify 'Such applications' in the second paragraph. Should be 'Process fault tolerant applications' - 17.1: Clarify that the user only needs to change the default error handler on the communication objects that they wish to have these fault tolerance options applied to. That the default error handler will automatically result in the termination of the communication group associated with the handle. - 17.2: Separate the state descriptions from the terminology presentation (e.g., unrecognized/recognized failed process) - 17.2: Definitions did not seem instructive. Some were too academic (error, fault, failure) while others too informal (alive process). Suggest to redefine them in a way that is easily understandable in the specific context of MPI. - 17.2: Suggest to pull out 'fault' into it's own definition. - 17.2: In definition of 'fail-stop' remove 'often due to a component crash' - 17.2: In definition of 'fail-stop' possible rephrasing into 'can no longer be communicated with' or 'no longer participates in MPI operations' - 17.2: In definitions of 'process failure' the word 'stop' seems imprecise. For stop may want something like 'stops communicating or responding.' May want to specify for transient that it is a correct process, to differentiate from byzantine. - 17.2: Suggest to move the definition of 'transient process failure' to the explanation of the failure detector where it is used. - 17.2: 'alive process' remove 'normal' or define as 'not failed' - 17.2: For unrecognized/recognized add reference to 'with the validate function' and a forward reference. - 17.2: In definition of 'collectively inactive' should 'and/or' be just 'and'? Or just 'contains some globally unrecognized failed process'? - 17.2: 'collectively active' add notion that it needs to be recognized using the validate_all command. Note that communicators are 'active' when created thus do not need to be 'activated' before using collectives unless there is a failure. - 17.2: For all of the state references (e.g., MPI_RANK_STATE_FAILED) add forward references or remove the reference. - 17.3: 'Strong completeness' sentence confused folks with 'will be able to be known to all processes'. Look this up again and go with the established wording if different. Suggest 'all alive processes will eventually be known be able to know of any failed processes." Something a bit more active in the wording suggested. - 17.3: 'Rationale' replace 'to deadlock situations' with 'to deadlocks' in last sentence. - 17.3: 'Advice to users' seems to be more targeted to an implementor, maybe look at rewording since this is a warning to the users. - 17.3: 'Advice to implementors' replace 'where able' with 'if able' or 'if possible' in last sentence. - 17.4: Mention that the state of a process is bound or relative to the object from which it was queried. The state is referenced using an MPI_Rank_info object. Note that process state is bound to an object. Add some rationale about how this can help layered libraries. - 17.4: A small state transition diagram would be helpful. - 17.4: Add rationale about the NULL state. Some applications will find the MPI_PROC_NULL semantics useful. - 17.4: It might be useful to distinguish here what the application knows versus what the MPI library knows. For some time the MPI library could be returning errors, but the MPI_Rank_info indicates that the rank is OK - the fact it is failed is hidden until the user calls a validate operation to update the local/global list. - 17.4.1: Suggest to change MPI_Rank_info into MPI_Rank_state since there is already an MPI_Info. - 17.4.1: Typo on page 539 line 43 - "indicates the state of process in the associated..." -> "indicates the state of the process the associated..." - 17.5: Clarify that separate validation function for groups, communicators, windows, and file handles are necessary because querying just for group provides a separate object related to, but not influencing of the original object. They become disjoint handles. - 17.5: Concerning that there is carried mutable state on each communication handle. - 17.5: General concern about the memory requirements of the state tracking. - 17.5: Some confusion about how this interface would be used in a threaded environment. Make sure that we clarify that the user is responsible for synchronization. Noted that using MPI_*_validate_all causes changes to queries using MPI_*_validate_get (so global update affects the local list). - 17.5: Question about why not just have the _get_state_rank function, why do we need the _get_state grouped operation? The reason is for scenarios where you only want to know the failed processes. This provides a single lookup instead of a linear search. This should be noted in a rationale. - 17.5: Clarify that the local list is local to the process and not just a thread. - 17.5: It would be useful to clarify that the local list management is useful for point-to-point heavy applications while global list management is useful for collective heavy applications. That most application lie in-between and should use an appropriate combination in their applications. - 17.5: It may be nice to add an advice to users noting that querying for the state of all processes in a large communicator could result in large memory consumption. Users should be aware. - 17.5.1: It was unclear if the group operations were needed at all. - 17.5.1: Clarify that groups do not have a notion of recognized failed processes since they are never used for communication, therefore recognized failed processes do not make sense. So note that there is no capability to move ranks in a group into the NULL state. - 17.5.1: In MPI_Group_validate_get_state there is an inaccurate back reference to 17.4.1. Both sections 17.4.1 and 17.4 should be referenced since one deals with state and one with the MPI_Rank_info object. - 17.5.1: Clarify that accessing the state on a group does not affect the state on the communicator it is associated with since as soon as the group is asked for then it is a separate object. - 17.5.2: Add a new MPI_*_validate_set_state_null that just takes a mask. This removes the need to allocate and manage the 'rank_infos' array if the user never intends to access it. Makes it easy to just NULL'ify all of the failed ranks in a communication handle. - 17.5.2: Make sure we clarify (I think we already do) that MPI_*_ivalidate cannot be canceled. - 17.5.2: Confirm that we only disallow collectives on the same communicator as the validate to overlap. So collectives on another communicator are unaffected. - 17.6.2: 'Advice to implementors' for mpiexec: We should explicitly say that a 'good quality implementation' should do the following if able. We could clarify 'exit code' - we meant it in the C sense, but should be clear. By example 8.7 it is possible for rank 0 to be dead and no other ranks return from MPI_Finalize (0 is the only one that is guaranteed to return, if alive). We should note what the user should expect in this case. There is also a question as to the reason for 'the lowest rank return code' versus 'any return code from an alive process'. There was also the question that maybe we should take the first return code from the first (instead of last) MPI_Abort call. This was clarified that choosing the 'last' MPI_Abort call was intended to support applications that recover using MPI_Abort then at some point later call MPI_Abort to terminate the job. This clarification should go into the discussion. - 17.7: Paragraph 1, sentence 1 conflicts with the last sentence of paragraph 2. This should be fixed. - 17.7: It was mentioned that it would be useful to explicitly mention that a point-to-point operations should not hang in the presence of errors, that they will eventually return with either success or some error. - 17.8: It was noted that in paragraph 2 the first sentence is a little unclear. It would be useful to restate the counter of it first, and keep both sentences. Unfortunately, I cannot pinpoint the sentence or suggestion from my notes. - 17.8: We should be more clear about the participation of recognized failures in collectives such as gather. Maybe as an advice to implementors. - 17.8: Clarify that the ordering of reduce operations may change across ranks after collective recognition. So for commutative operations when the tree is rebalanced the ordering changes. This is something that is not guaranteed by the standard, but recommended that implementations should not change the order between calls. So, with a reference to section 5.9.1, we should note this in the FT chapter. Suggestions for items to be considered as separate tickets: ----------------------------------------------------------- - 8.3.4: MPI_Errhandler_compare() - 17.6.3: MPI_Comm_Kill