[Mpi3-ft] run through stabilization user-guide
toon.knapen at gmail.com
Wed Feb 9 09:04:42 CST 2011
> It is important to clarify what we can and cannot specify in the standard.
> So I appreciate your help with explaining how you think the language can be
> more precise or better explained.
> For this proposal, we are focused on graceful degradation of the job.
> Processes will fail(-stop), but the job as a whole is allowed the
> opportunity to continue operating.
> We are defining process failures as fail-stop failures in the standard. So
> we cover processes that crash, and do not continue to participate in the
> parallel environment. How the MPI implementation detects process failure is
> not defined by the standard, except by the properties that the MPI
> implementation must provide to the application. The MPI implementation will
> provide a view of the failure detector that is 'perfect' from the
> perspective of the application (though internally there is a fair amount of
> flexibility on how to provide this guarantee).
I understand that processes that are detected (by the MPI lib) to fail will
be stopped. That makes sense. What I do worry about is that a process or
communication with a specific process is going awry.
> It is the responsibility of the application to design around this type of
> situation to ensure continued progress of their application - making a fault
> aware application from an otherwise fault unaware application. There are a
> few ways to do this, but the best solution will always be domain specific.
> It is possible that the application detects that a peer process is faulty
> in a different way than fail-stop (e.g., byzantine). For example, a peer
> process may have incurred a soft-error memory corruption, and is sending
> invalid data (but valid from the MPI perspective). A peer process could be
> checking the values, and determine that the peer is faulty. At which point
> it can either:
> - Coordinate with the other alive peers to exclude the faulty process, or
> - Use MPI_Comm_kill() to request that the process be terminated.
> MPI_Comm_kill() is not described in the user's guide, but is in the main
> proposal. It allows one process to kill another without killing itself
> (which would happen if they used MPI_Abort). Is this a scenario that you
> were concerned about?
Yes, this is important to me. If not having a precise definition of what a
'failed' process actually is (and I understand why you avoid defining it),
I'll design the app. such that it might itself decide that some
(communication to a) process is failing and thus I want to be able to take
the decision myself too. So this makes me feel more comfortable about it.
> > Considering many apps work in master-slave mode, I would like to be able
> to guarantee that the side on which the master resides is not killed.
> You should be able to do this by creating subsets of communicators and
> setting error handers appropriately on the application end. Organize
> 'workers' into volatile groups (error handler = MPI_ERRORS_ARE_FATAL), while
> the 'manager' process(es) only ever participate with communicators that do
> not have a fatal error handler.
> I don't think I understand. The master and slave need to have a common
communicator which will be used to communicate between the two. If
the associated error handler is MPI_ERRORS_RETURN and the communication
between the two is cut, the MPI lib might decide to kill either of the two ?
> Does that help clarify?
> A lot. Thanks!
I think the way you (and others) clarified this, it would serve the
user-guide too to discuss more in-depth what exactly a failure is, why it is
not defined in detail and the ability for the app to detect errors and kill
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft