Index: ft.tex =================================================================== --- ft.tex (revision 907) +++ ft.tex (working copy) @@ -89,7 +89,7 @@ Strong completeness is defined as: ``Eventually every process that crashes is permanently suspected by every correct process''~\cite{Chandra_ft_1996}. In essence this means that eventually every failed process will be known to all alive processes. -Without strong completeness communication operations with a failed process may not complete with an error, so it is possible that a process communicating with a failed process may wait indefinitely in, e.g., a blocking receive operation. +Without strong completeness communication operations with a failed process may not complete with an error, so it is possible that an alive process communicating with a failed process may wait indefinitely in, for example, a blocking receive operation. Eventual strong accuracy is defined as: ``There is a time after which correct processes are not suspected by any correct process''~\cite{Chandra_ft_1996}. @@ -121,10 +121,10 @@ \section{Querying for Failed Processes} \label{sec:query-fail-proc} -At each process, the MPI implementation keeps track of failed +The MPI implementation keeps track of failed processes. Query functions are provided to allow the user to determine which processes associated with a specific communicator, -file or window have failed. These functions return a group comprising +window or file handle have failed. These functions return a group comprising the failed processes. \subsection{Communicators} @@ -169,7 +169,7 @@ \subsection{Files} The following function returns the group of failed processes -associated with a file. +associated with a file handle. \begin{funcdef}{MPI\_FILE\_GET\_GROUP\_FAILED(fh, failed)} \funcarg{\IN}{fh}{file handle (handle)} @@ -188,7 +188,7 @@ \exindex{MPI\_COMM\_GROUP\_FAILED} \exindex{MPI\_GROUP\_TRANSLATE\_RANKS} \exindex{Determine whether a process has failed} -Determine whether process with rank 5 has failed in the communicator. +Determine whether the process with rank 5 has failed in the communicator. %%HEADER %%LANG: C %%FRAGMENT @@ -322,13 +322,7 @@ These semantics allow a process to continue running without being interrupted by the failure of processes with which they may never or rarely communicate. \end{rationale} -\begin{users} -A newly created communicator inherits the error handler that is associated with the parent communicator. -Libraries should take care to set the error handler appropriately for their library directly after communicator creation. -This allows the library to have its own error handler behavior separate from the calling process. -\end{users} - %---------------------------------------- \subsection{Error Codes and Classes} \label{sec:ft-env:error-codes} @@ -541,8 +535,8 @@ \par If one or both of the processes fail during either \mpifunc{MPI\_SENDRECV} or \mpifunc{MPI\_SENDRECV\_REPLACE} the function will return \const{MPI\_ERR\_IN\_STATUS}. -If one process failed then this process will be identified in the \mpiarg{status}. -If both processes fail then only one of the processes will be identified in the \mpiarg{status}. +If one process failed then the rank associated with this process will be identified in the \mpiarg{status}. +If both processes fail then only one of the ranks associated with the processes will be identified in the \mpiarg{status}. The query functions defined in Section~\ref{sec:query-fail-proc} can be used to determine the state of the other process. If an error handler function is registered to the communicator then it will be called only once for the operation regardless of the number of failed processes. @@ -618,7 +612,7 @@ \begin{rationale} One option considered was to change the \MPI/ collective semantics to disallow leave early semantics and implement an agreement algorithm at the end of every collective operation. -This would allow all processes to receive consistent return values. +This would allow all processes to receive uniformly consistent return values. Due to the considerable overhead implications of this option, it was decided to allow for the looser consistency model to minimize the performance impact of the fault tolerance code path, and to provide the agreement protocol as a separate operation (e.g., \mpifunc{MPI\_COMM\_VALIDATE} described in Section~\ref{sec:ft-coll:validate}). \end{rationale} @@ -630,7 +624,7 @@ \begin{rationale} -Calling a function to collectively validate a communicator gives the \MPI/ implementation an opportunity to restructure collective communication patterns before the communicator is used by the alive process. +The \mpifunc{MPI\_COMM\_VALIDATE} and \mpifunc{MPI\_ICOMM\_VALIDATE} operations provide the \MPI/ implementation an opportunity to restructure collective communication patterns before the communicator is used by the alive process. Without this requirement the \MPI/ implementation may need to determine which processes in the communicator are alive and which are failed for every collective operation. This results in performance restrictive semantics for every collective call. The collective validate operation allows the \MPI/ library to trust the agreed upon set of communication patterns for the collectives reducing the impact of the fault tolerance logic on failure-free collective performance. @@ -655,6 +649,14 @@ \mpifunc{MPI\_COMM\_VALIDATE} will either provide the same group of failed processes in \mpiarg{failed} to every process or will return an error at every process. All collective communication operations initiated before the call to \mpifunc{MPI\_COMM\_VALIDATE} must also complete before it is called, and no collective calls may be initiated until it has completed. +\begin{rationale} +The \mpifunc{MPI\_COMM\_VALIDATE} and \mpifunc{MPI\_ICOMM\_VALIDATE} operations provide the \MPI/ implementation an opportunity to restructure collective communication patterns before the communicator is used by the alive process. +This may allow for improved collective performance after process failure. +It should be noted such optimizations might change the consistency recommendation for \mpifunc{MPI\_REDUCE} in the advice to implementors in Section~\ref{subsec:coll-reduce}. +It is strongly recommended that the consistency recommendation hold for \mpifunc{MPI\_REDUCE} between consecutive collective activations of a communicator using a collective validation operation (e.g, \mpifunc{MPI\_COMM\_VALIDATE}). +\end{rationale} + + \begin{funcdef}{MPI\_ICOMM\_VALIDATE(comm, failed, req)} \funcarg{\IN}{comm}{communicator (handle)} \funcarg{\OUT}{failed}{group of failed processes (handle)} @@ -1004,6 +1006,12 @@ \par In the presence of process failures, the communicator construction operations must ensure that the communicator is either created successfully at all participating processes; or not created, and all participating processes return some error. +\begin{users} +A newly created communicator inherits the error handler that is associated with the parent communicator. +Libraries should take care to set the error handler appropriately for their library directly after communicator creation. +This allows the library to have its own error handler behavior separate from the calling process. +\end{users} + \begin{implementors} The uniform creation of the communicator handle semantic constraint is similar to the constraint on \mpifunc{MPI\_COMM\_VALIDATE}. In fact, an implementation can wrap existing communicator creation functions in a recovery block loop bound by \mpifunc{MPI\_COMM\_VALIDATE} operations to achieve the necessary semantic constraint. @@ -1120,7 +1128,7 @@ Setting the \mpiarg{root} argument in the accept and connect operations to the rank of a failed process will raise an error of the class \const{MPI\_ERR\_RANK}. \par -\mpifunc{MPI\_COMM\_DISCONNECT} will complete normally even in the presence of process failures, regardless of when the process failure occurs or if the process failure is recognized. +\mpifunc{MPI\_COMM\_DISCONNECT} will complete normally even in the presence of process failures. \par In the case of an error returned from \mpifunc{MPI\_COMM\_JOIN}, the state of the associated socket file descriptor (\mpiarg{fd}) is undefined. @@ -1178,7 +1186,7 @@ In the presence of process failures, the \mpifunc{MPI\_WIN\_CREATE} operation must ensure that the window is either created successfully at all participating processes; or not created, and all participating processes return some error. \par -\mpifunc{MPI\_WIN\_FREE} will complete normally even in the presence of process failures, regardless of when the process failure occurs. +\mpifunc{MPI\_WIN\_FREE} will complete normally even in the presence of process failures. \par One-sided communication (e.g., \mpifunc{MPI\_PUT}, \mpifunc{MPI\_GET}) with failed processes will return \const{MPI\_ERR\_PROC\_FAIL\_STOP}. @@ -1194,7 +1202,7 @@ \label{sec:ft-onesided:validate} \begin{rationale} -Since the communicator associated with the window cannot be accessed after window creation and since groups cannot be used for communication it is necessary to defined a validation operation specific to windows in addition to communicators (see Section~\ref{sec:ft-coll}). +Since the communicator associated with the window cannot be accessed after window creation and since groups cannot be used for communication it is necessary to define a validation operation specific to windows in addition to communicators (see Section~\ref{sec:ft-coll}). \end{rationale} \begin{funcdef}{MPI\_WIN\_VALIDATE(win, failed)} @@ -1288,7 +1296,7 @@ \begin{users} The state of the external file must be determined by the application (e.g., Did a failed process finish writing/reading/syncing before failing?). The application may be able to use the \mpifunc{MPI\_FILE\_READ\_AT} operation to determine the state of the file. -The collective validate operations (e.g., \mpifunc{MPI\_FILE\_VALIDATE} described in Section~\ref{sec:ft-io:validate}) help to ensure buffers are fully flushed to disk. +The collective validate operations for file handles (e.g., \mpifunc{MPI\_FILE\_VALIDATE} described in Section~\ref{sec:ft-io:validate}) help to ensure buffers are fully flushed to disk. \end{users} %\begin{implementors} @@ -1300,7 +1308,7 @@ \label{sec:ft-io:validate} \begin{rationale} -Since the communicator associated with the file handle cannot be accessed after creation and since groups cannot be used for communication it is necessary to defined a validation operation specific to file handles in addition to communicators (see Section~\ref{sec:ft-coll}). +Since the communicator associated with the file handle cannot be accessed after creation and since groups cannot be used for communication it is necessary to define a validation operation specific to file handles in addition to communicators (see Section~\ref{sec:ft-coll}). \end{rationale} \par @@ -1320,6 +1328,11 @@ \mpifunc{MPI\_FILE\_VALIDATE} will either provide the same group of failed processes in \mpiarg{failed} to every process or will return an error at every process. All collective communication operations initiated before the call to \mpifunc{MPI\_FILE\_VALIDATE} must also complete before it is called, and no collective calls may be initiated until it has completed. +\begin{users} +The \mpifunc{MPI\_FILE\_VALIDATE} operation is similar to the ``sync-barrier-sync'' construct from the example in Section~\ref{sec:io-consistency-examples}. +The file is synchronized to disk, and all failures are recognized in the associated group. +\end{users} + \par \begin{funcdef}{MPI\_IFILE\_VALIDATE(fh, failed, req)} \funcarg{\IN}{fh}{file handle (handle)} @@ -1369,9 +1382,9 @@ \par In the presence of process failures, the \mpifunc{MPI\_FILE\_OPEN} operation must ensure that the file handle is either created successfully at all participating processes; or not created, and all participating processes return some error. -\begin{users} -Opening a file with recognized failed processes may be useful for an application to dump state before terminating the application. -\end{users} +%\begin{users} +%Opening a file with recognized failed processes may be useful for an application to dump state before terminating the application. +%\end{users} \begin{implementors} The \mpiarg{info} argument to the \mpifunc{MPI\_FILE\_OPEN} operation may be used to modify the fault tolerance semantics of the operation.