[Mpi3-ft] run through stabilization user-guide
Toon Knapen
toon.knapen at gmail.com
Tue Feb 8 07:14:51 CST 2011
>
> The spec explicitly doesn’t define the meaning of failure because it is
> a very low-level concept. The MPI implementation is responsible for
> detecting failures and defining what qualifies as one. The only guarantee
> that you can rely on is that if the “failure” is bad enough that a given MPI
> rank can’t communicate with others, then MPI will have to either abort the
> application completely or eventually report this failure via the FT API.
>
I understand it is really hard to define what is exactly a failure. But if
the future standard does not describe it, is'nt there a risk that apps that
rely on the FT API will become hard to port from one implementation to
another ?
For instance, suppose I have one process constantly sending messages to the
other one. Suppose the second process is dead, the first process thus might
either detect that the second process is down and inform the app. Or the
first process might not tell anything to the app and buffer all messages to
be send. But this buffer might grow so big that the first process will run
out-of-memory and dies too.
So I think it would be usefull that the mpi-library is given some bounds
within which it should function and if it is not able to function within
these bounds it should raise an error.
For instance a simple limit might be that each node should respond to a ping
within 1s (or 1ms. or ...). If a node can not be pinged (within this limit),
the node is considered to be dead.
Another bound that might be specified and that would serve my example above
is e.g. the memory bounds within which MPI should function. For instance,
the app (or user) might decide to allocate 100 MB of (buffer-) memory to
MPI. If the app fills this buffer and the MPI-lib is not able to flush the
data sufficiently fast to the other nodes, an error will be raised. At that
point the app is aware that there is a failure to communicate and can take
appropriate action: slow down a bit with the sending or consider the other
node to have failed.
In the above, it is the MPI lib that always detects the error. However if
the MPI library does not guarantee what a failure is exactly, I might have
to detect failures in the app (to be independent of the free interpretation
of failure). In that case I might e.g. need to always do non-blocking sends
and recvs (to avoid being blocked until eternity in case of a failure) and
see if the messages arrive in a timely manner, if not I consider the other
process to be dead. In that case however, I would also need functionality to
tell to the MPI-lib that a specific (comm,rank) should be considered
MPI_RANK_STATE_NULL.
I'm sorry if the above has been discussed at length already and the forum
already decided not to define 'failure' (I should'nt have left the MPI-scene
for the last three years, what was I thinking ;-). Trying to define
'failure' might be opening pandora's box but I'm looking at this from the
appication point-of-view. IMHO FT is either just about 'going down
gracefully' or about 'trying to finish the job'.
>
>
> Note that this view is focused on process failures. When it is applied to
> things like network partitions (this includes the case you mentioned where
> one process can’t talk to any other due to a failed network card) then
> processes on both parts of the partition may be informed that the others
> have failed. As such, when connectivity is restored, since MPI will be
> responsible to maintaining self-consistency of its previous notifications,
> it’ll have to kill processes on one side of the partition to keep consistent
> with the notifications it gave to the other partition.
>
Considering many apps work in master-slave mode, I would like to be able to
guarantee that the side on which the master resides is not killed.
toon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20110208/1ec7c982/attachment-0001.html>
More information about the mpiwg-ft
mailing list