<div>Hi all,</div>
<div> </div>
<div>(I've ended up on this ml because of a recent discussion on the OpenMPI ml referring to this user-guide and because I'm working on a parallel financial app that runs in batch-mode but will also be run in online-mode. The calculations are only loosely coupled calculations but stability/fault-tolerance is important when running in batch-mode and certainly when running online)</div>
<div> </div>
<div>So I ended up reading the user guide (<a href="https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_users_guide" target="_blank">https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_users_guide</a>) and since it seems destined to app developers like me, I'll provide some feedback from my pov:</div>
<div> </div>
<div>In the motivation section, paragraph 4: Hopefully not a silly question but what is actually considered to be a failure ? Looking into the archives of this ml there are many discussions on the topic but what is actually the current conclusion?</div>
<div><br></div><div>Additionally, is it possible that a process can detect that the process itself fails, e.g. if its network went down and thus unable to communicate any further.</div><div><br></div><div>Finally, I think there is a typo in the first example in the section Point-to-Point operators. It relies on 'status' for knowing the failed rank while IMO it should be 'peer'.</div>
<div><br></div><div>toon</div><div><br></div><div><br></div>
<div> </div>
<div> </div>