[Mpi3-ft] run through stabilization user-guide

Toon Knapen toon.knapen at gmail.com
Sun Feb 6 12:19:13 CST 2011


Hi all,

(I've ended up on this ml because of a recent discussion on the OpenMPI ml
referring to this user-guide and because I'm working on a parallel financial
app that runs in batch-mode but will also be run in online-mode. The
calculations are only loosely coupled calculations but
stability/fault-tolerance is important when running in batch-mode and
certainly when running online)

So I ended up reading the user guide (
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_users_guide)
and since it seems destined to app developers like me, I'll provide some
feedback from my pov:

In the motivation section, paragraph 4: Hopefully not a silly question but
what is actually considered to be a failure ? Looking into the archives of
this ml there are many discussions on the topic but what is actually the
current conclusion?

Additionally, is it possible that a process can detect that the process
itself fails, e.g. if its network went down and thus unable to communicate
any further.

Finally, I think there is a typo in the first example in the section
Point-to-Point operators. It relies on 'status' for knowing the failed rank
while IMO it should be 'peer'.

toon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20110206/ffd08e00/attachment.html>


More information about the mpiwg-ft mailing list