[Mpi3-ft] run through stabilization user-guide

Wed Feb 9 09:22:08 CST 2011

>
> Considering many apps work in master-slave mode, I would like to be able to guarantee that the side on which the master resides is not killed.
You should be able to do this by creating subsets of communicators and setting error handers appropriately on the application end. Organize 'workers' into volatile groups (error handler = MPI_ERRORS_ARE_FATAL), while the 'manager' process(es) only ever participate with communicators that do not have a fatal error handler.
I don't think I understand. The master and slave need to have a common communicator which will be used to communicate between the two. If the associated error handler is MPI_ERRORS_RETURN and the communication between the two is cut, the MPI lib might decide to kill either of the two ?

If the workers use communicators that are MPI_ERRORS_FATAL, if there is a disconnect with the master, they will be automatically aborted. Meanwhile, the master will be informed about their "failure" because of the disconnect and when connection to the physical nodes that previously hosted the aborted workers is re-established, the master's MPI library will see that worker tasks are dead and will not need to kill the master.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Toon Knapen
Sent: Wednesday, February 09, 2011 7:05 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] run through stabilization user-guide

It is important to clarify what we can and cannot specify in the standard. So I appreciate your help with explaining how you think the language can be more precise or better explained.

For this proposal, we are focused on graceful degradation of the job. Processes will fail(-stop), but the job as a whole is allowed the opportunity to continue operating.

We are defining process failures as fail-stop failures in the standard. So we cover processes that crash, and do not continue to participate in the parallel environment. How the MPI implementation detects process failure is not defined by the standard, except by the properties that the MPI implementation must provide to the application. The MPI implementation will provide a view of the failure detector that is 'perfect' from the perspective of the application (though internally there is a fair amount of flexibility on how to provide this guarantee).

I understand that processes that are detected (by the MPI lib) to fail will be stopped. That makes sense. What I do worry about is that a process or communication with a specific process is going awry.

It is the responsibility of the application to design around this type of situation to ensure continued progress of their application - making a fault aware application from an otherwise fault unaware application. There are a few ways to do this, but the best solution will always be domain specific.
OK

It is possible that the application detects that a peer process is faulty in a different way than fail-stop (e.g., byzantine). For example, a peer process may have incurred a soft-error memory corruption, and is sending invalid data (but valid from the MPI perspective). A peer process could be checking the values, and determine that the peer is faulty. At which point it can either:
 - Coordinate with the other alive peers to exclude the faulty process, or
 - Use MPI_Comm_kill() to request that the process be terminated.
MPI_Comm_kill() is not described in the user's guide, but is in the main proposal. It allows one process to kill another without killing itself (which would happen if they used MPI_Abort). Is this a scenario that you were concerned about?
Yes, this is important to me. If not having a precise definition of what a 'failed' process actually is (and I understand why you avoid defining it), I'll design the app. such that it might itself decide that some (communication to a) process is failing and thus I want to be able to take the decision myself too. So this makes me feel more comfortable about it.

>
> Considering many apps work in master-slave mode, I would like to be able to guarantee that the side on which the master resides is not killed.
You should be able to do this by creating subsets of communicators and setting error handers appropriately on the application end. Organize 'workers' into volatile groups (error handler = MPI_ERRORS_ARE_FATAL), while the 'manager' process(es) only ever participate with communicators that do not have a fatal error handler.
I don't think I understand. The master and slave need to have a common communicator which will be used to communicate between the two. If the associated error handler is MPI_ERRORS_RETURN and the communication between the two is cut, the MPI lib might decide to kill either of the two ?

Does that help clarify?
A lot. Thanks!
I think the way you (and others) clarified this, it would serve the user-guide too to discuss more in-depth what exactly a failure is, why it is not defined in detail and the ability for the app to detect errors and kill processes.

toon

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20110209/04636f16/attachment-0001.html>