<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class="">Hi Martin, et al,</div><div class=""><br class=""></div><div class="">To tackle your second point — referring to the attached Venn diagram (also visible at <a href="https://miro.com/app/board/o9J_lLW6LU4=/" class="">https://miro.com/app/board/o9J_lLW6LU4=/</a>) — I think that Sessions v2.0 should include the intersection of Application and System (i.e. AS) but all of SYS, just as I want it to include the intersection of Application and Resource Manager (i.e. AR) but not all of RM. If the Resource Manager is written as an MPI program (not an APP), then RS becomes like AS and is also included in Sessions v2.0 by virtue of that fact alone.</div><div class=""><br class=""></div><div class="">Fault tolerance (FT) is an amorphous emergent property of a system comprising all these components working together to get useful work done despite problems. Thus, “include FT” is not well enough defined, IMHO.</div><div class=""><br class=""></div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">Definitions</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">—</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">SYS: something in the system must detect faults and propagate/disseminate that information appropriately.</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">RM: something in the system must allocate resources and recover them when appropriate.</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">APP: the only non-infrastructure component — the thing that attempts to do useful work.</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">AR: the interface between APP and RM — currently whatever job start mechanism exists (job script, mpiexec/mpirun command-line parameters, MPI_INIT) and parts of the dynamic model (MPI_COMM_SPAWN creates new MPI processes, but may not create new OS processes or allocate additional hardware CPUs).</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">AS: the interface between APP and SYS — currently whatever error reporting mechanisms exist, including MPI errors, but no direct interface for failures (except as implied by errors) or faults (except as guessed from errors and implied failures).</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">RS: the interface between RM and SYS — currently whatever reliability mechanisms exist, mostly/entirely outwith the current scope of both MPI (the interface) and MPI implementations (the library code).</div><div style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></div><div class="">Musings</div><div class="">—</div><div class="">MPI: I see the entirety of MPI (interface and library code) as the white rectangle — bits of MPI fit into every portion of this diagram.</div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">ULFM: </span>I see the shrink procedure as fitting within AS, attempts to replace failed processes (with spares or newly spawned ones) fit within ARS, and the application adjusting to cope with different resources fits within AR.</div><div class="">ReInit: I see this as extending APP into AR only — all interactions with SYS are done outwith the APP.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""></div></body></html>