[Mpi3-ft] run through stabilization user-guide

Bronevetsky, Greg bronevetsky1 at llnl.gov
Sun Feb 6 16:05:30 CST 2011


Toon, thank you for the feedback! Let me try to answer your questions, though it would also be great to hear from Josh who's now the guiding spirit of the group.

The spec explicitly doesn't define the meaning of failure because it is a very low-level concept. The MPI implementation is responsible for detecting failures and defining what qualifies as one. The only guarantee that you can rely on is that if the "failure" is bad enough that a given MPI rank can't communicate with others, then MPI will have to either abort the application completely or eventually report this failure via the FT API.

Note that this view is focused on process failures. When it is applied to things like network partitions (this includes the case you mentioned where one process can't talk to any other due to a failed network card) then processes on both parts of the partition may be informed that the others have failed. As such, when connectivity is restored, since MPI will be responsible to maintaining self-consistency of its previous notifications, it'll have to kill processes on one side of the partition to keep consistent with the notifications it gave to the other partition.

We haven't talked about the process itself detecting failures independently of the MPI library. Right now the only thing a process can do is call MPI_Abort() if it notices that its core is not operating correctly. However, in the future I can easily see us providing applications with call via which they can inform MPI about network failures and failures of other processes.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Toon Knapen
Sent: Sunday, February 06, 2011 10:19 AM
To: mpi3-ft at lists.mpi-forum.org
Subject: [Mpi3-ft] run through stabilization user-guide

Hi all,

(I've ended up on this ml because of a recent discussion on the OpenMPI ml referring to this user-guide and because I'm working on a parallel financial app that runs in batch-mode but will also be run in online-mode. The calculations are only loosely coupled calculations but stability/fault-tolerance is important when running in batch-mode and certainly when running online)

So I ended up reading the user guide (https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_users_guide) and since it seems destined to app developers like me, I'll provide some feedback from my pov:

In the motivation section, paragraph 4: Hopefully not a silly question but what is actually considered to be a failure ? Looking into the archives of this ml there are many discussions on the topic but what is actually the current conclusion?

Additionally, is it possible that a process can detect that the process itself fails, e.g. if its network went down and thus unable to communicate any further.

Finally, I think there is a typo in the first example in the section Point-to-Point operators. It relies on 'status' for knowing the failed rank while IMO it should be 'peer'.

toon




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20110206/09cd8f2d/attachment-0001.html>


More information about the mpiwg-ft mailing list