[Mpi3-ft] run through stabilization user-guide
Graham, Richard L.
rlgraham at ornl.gov
Tue Feb 8 09:16:50 CST 2011
Quite a while back we decided that we were not going to handle byzantine failures.
On 2/8/11 9:39 AM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
I believe the standard does define process failures as fail-stop. However, the standard does not describe how to detect failures. Doing this would severely limit implementations.
I suspect that most implementations will rely on external mechanisms to detect process failure, such as ipmi, or a ras system. MPICH2 relies on the hydra process manager to detect abnormal process termination and node failures. I'm not sure whether fail-stop is exactly the same thing as abnormal process termination, but realistically this is how most vendors will implement it.
Distinguishing between unresponsive processes and crashed processes is impossible to do. Even you add heartbeat messages between MPI libraries will only tell you that the library hasn't crashed. You could construct a scenario where the application is in an endless loop but doesn't call MPI_Recv, and the sender runs out of resources, even though the library is still alive. I believe crashed, but not terminated, would not be considered fail-stopped (but probably byzantine), and so outside the scope of the standard. Users that need to handle these kinds of failures would need to implement their own mechanism to detect hung processes, since they are in the best position to decide when a process is hung.
On Feb 8, 2011, at 5:14 AM, Toon Knapen wrote:
> The spec explicitly doesn't define the meaning of failure because it is a very low-level concept. The MPI implementation is responsible for detecting failures and defining what qualifies as one. The only guarantee that you can rely on is that if the "failure" is bad enough that a given MPI rank can't communicate with others, then MPI will have to either abort the application completely or eventually report this failure via the FT API.
> I understand it is really hard to define what is exactly a failure. But if the future standard does not describe it, is'nt there a risk that apps that rely on the FT API will become hard to port from one implementation to another ?
> For instance, suppose I have one process constantly sending messages to the other one. Suppose the second process is dead, the first process thus might either detect that the second process is down and inform the app. Or the first process might not tell anything to the app and buffer all messages to be send. But this buffer might grow so big that the first process will run out-of-memory and dies too.
> So I think it would be usefull that the mpi-library is given some bounds within which it should function and if it is not able to function within these bounds it should raise an error.
> For instance a simple limit might be that each node should respond to a ping within 1s (or 1ms. or ...). If a node can not be pinged (within this limit), the node is considered to be dead.
> Another bound that might be specified and that would serve my example above is e.g. the memory bounds within which MPI should function. For instance, the app (or user) might decide to allocate 100 MB of (buffer-) memory to MPI. If the app fills this buffer and the MPI-lib is not able to flush the data sufficiently fast to the other nodes, an error will be raised. At that point the app is aware that there is a failure to communicate and can take appropriate action: slow down a bit with the sending or consider the other node to have failed.
> In the above, it is the MPI lib that always detects the error. However if the MPI library does not guarantee what a failure is exactly, I might have to detect failures in the app (to be independent of the free interpretation of failure). In that case I might e.g. need to always do non-blocking sends and recvs (to avoid being blocked until eternity in case of a failure) and see if the messages arrive in a timely manner, if not I consider the other process to be dead. In that case however, I would also need functionality to tell to the MPI-lib that a specific (comm,rank) should be considered MPI_RANK_STATE_NULL.
> I'm sorry if the above has been discussed at length already and the forum already decided not to define 'failure' (I should'nt have left the MPI-scene for the last three years, what was I thinking ;-). Trying to define 'failure' might be opening pandora's box but I'm looking at this from the appication point-of-view. IMHO FT is either just about 'going down gracefully' or about 'trying to finish the job'.
> Note that this view is focused on process failures. When it is applied to things like network partitions (this includes the case you mentioned where one process can't talk to any other due to a failed network card) then processes on both parts of the partition may be informed that the others have failed. As such, when connectivity is restored, since MPI will be responsible to maintaining self-consistency of its previous notifications, it'll have to kill processes on one side of the partition to keep consistent with the notifications it gave to the other partition.
> Considering many apps work in master-slave mode, I would like to be able to guarantee that the side on which the master resides is not killed.
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
More information about the mpiwg-ft