[Mpi3-ft] run through stabilization user-guide

Joshua Hursey jjhursey at open-mpi.org
Tue Feb 8 08:47:02 CST 2011


Thanks for your feedback. I'll fix the first example, thanks for the catch. A bit more discussion on the other points below.

On Feb 8, 2011, at 8:14 AM, Toon Knapen wrote:

> The spec explicitly doesn’t define the meaning of failure because it is a very low-level concept. The MPI implementation is responsible for detecting failures and defining what qualifies as one. The only guarantee that you can rely on is that if the “failure” is bad enough that a given MPI rank can’t communicate with others, then MPI will have to either abort the application completely or eventually report this failure via the FT API.
> I understand it is really hard to define what is exactly a failure. But if the future standard does not describe it, is'nt there a risk that apps that rely on the FT API will become hard to port from one implementation to another ?
> For instance, suppose I have one process constantly sending messages to the other one. Suppose the second process is dead, the first process thus might either detect that the second process is down and inform the app. Or the first process might not tell anything to the app and buffer all messages to be send. But this buffer might grow so big that the first process will run out-of-memory and dies too.
> So I think it would be usefull that the mpi-library is given some bounds within which it should function and if it is not able to function within these bounds it should raise an error.
> For instance a simple limit might be that each node should respond to a ping within 1s (or 1ms. or ...). If a node can not be pinged (within this limit), the node is considered to be dead.
> Another bound that might be specified and that would serve my example above is e.g. the memory bounds within which MPI should function. For instance, the app (or user) might decide to allocate 100 MB of (buffer-) memory to MPI. If the app fills this buffer and the MPI-lib is not able to flush the data sufficiently fast to the other nodes, an error will be raised. At that point the app is aware that there is a failure to communicate and can take appropriate action: slow down a bit with the sending or consider the other node to have failed.
> In the above, it is the MPI lib that always detects the error. However if the MPI library does not guarantee what a failure is exactly, I might have to detect failures in the app (to be independent of the free interpretation of failure). In that case I might e.g. need to always do non-blocking sends and recvs (to avoid being blocked until eternity in case of a failure) and see if the messages arrive in a timely manner, if not I consider the other process to be dead. In that case however, I would also need functionality to tell to the MPI-lib that a specific (comm,rank) should be considered MPI_RANK_STATE_NULL.
> I'm sorry if the above has been discussed at length already and the forum already decided not to define 'failure' (I should'nt have left the MPI-scene for the last three years, what was I thinking ;-). Trying to define 'failure' might be opening pandora's box but I'm looking at this from the appication point-of-view. IMHO FT is either just about 'going down gracefully' or about 'trying to finish the job'.

It is important to clarify what we can and cannot specify in the standard. So I appreciate your help with explaining how you think the language can be more precise or better explained.

For this proposal, we are focused on graceful degradation of the job. Processes will fail(-stop), but the job as a whole is allowed the opportunity to continue operating.

We are defining process failures as fail-stop failures in the standard. So we cover processes that crash, and do not continue to participate in the parallel environment. How the MPI implementation detects process failure is not defined by the standard, except by the properties that the MPI implementation must provide to the application. The MPI implementation will provide a view of the failure detector that is 'perfect' from the perspective of the application (though internally there is a fair amount of flexibility on how to provide this guarantee).

This means that eventually all processes will know of a process failure, and that if the application receives notification of process failure then that process is guaranteed to be failed-stop. Once an alive process is notified of a peer process failure then any communication involving the failed process will complete with an error. So there is no worry that you will block completely on communication to a failed peer. The MPI library should be cleaning up internal buffers, etc at this time as well to conserve memory. Communication with non-failed processes will complete as normal. Collectives are disabled until the application re-enabled them with the collective MPI_Comm_validate_all().

Now you could block in the following scenario:
 - Processes A, B, D are alive, Process C is failed.
 - Processes are communicating in a ring using point-to-point operations as follows: A->B->C->D->A...
 - Since processes B and D are directly interacting with C, then they will see failure from their send/recv point-to-point communication. But A may block waiting on either B or D.

It is the responsibility of the application to design around this type of situation to ensure continued progress of their application - making a fault aware application from an otherwise fault unaware application. There are a few ways to do this, but the best solution will always be domain specific.

It is possible that the application detects that a peer process is faulty in a different way than fail-stop (e.g., byzantine). For example, a peer process may have incurred a soft-error memory corruption, and is sending invalid data (but valid from the MPI perspective). A peer process could be checking the values, and determine that the peer is faulty. At which point it can either:
 - Coordinate with the other alive peers to exclude the faulty process, or
 - Use MPI_Comm_kill() to request that the process be terminated.
MPI_Comm_kill() is not described in the user's guide, but is in the main proposal. It allows one process to kill another without killing itself (which would happen if they used MPI_Abort). Is this a scenario that you were concerned about?

As an aside, providing a memory boundary or specifying a heartbeat timeout value is generally difficult to do in the standard. But I don't think you necessarily need this functionality since it is a quality of implementation issue on how soon the MPI library will notify the application of a process failure, and how well it manages the buffers internally.

> Note that this view is focused on process failures. When it is applied to things like network partitions (this includes the case you mentioned where one process can’t talk to any other due to a failed network card) then processes on both parts of the partition may be informed that the others have failed. As such, when connectivity is restored, since MPI will be responsible to maintaining self-consistency of its previous notifications, it’ll have to kill processes on one side of the partition to keep consistent with the notifications it gave to the other partition.
> Considering many apps work in master-slave mode, I would like to be able to guarantee that the side on which the master resides is not killed.

You should be able to do this by creating subsets of communicators and setting error handers appropriately on the application end. Organize 'workers' into volatile groups (error handler = MPI_ERRORS_ARE_FATAL), while the 'manager' process(es) only ever participate with communicators that do not have a fatal error handler.

Does that help clarify?

Thanks for the feedback,

> toon
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list