[mpiwg-sessions] Discussion document from Today

Dan Holmes danholmes at chi.scot
Thu Apr 8 07:00:11 CDT 2021


Hi Martin, et al,

To tackle your second point — referring to the attached Venn diagram (also visible at https://miro.com/app/board/o9J_lLW6LU4=/) — I think that Sessions v2.0 should include the intersection of Application and System (i.e. AS) but all of SYS, just as I want it to include the intersection of Application and Resource Manager (i.e. AR) but not all of RM. If the Resource Manager is written as an MPI program (not an APP), then RS becomes like AS and is also included in Sessions v2.0 by virtue of that fact alone.

Fault tolerance (FT) is an amorphous emergent property of a system comprising all these components working together to get useful work done despite problems. Thus, “include FT” is not well enough defined, IMHO.

Definitions
—
SYS: something in the system must detect faults and propagate/disseminate that information appropriately.
RM: something in the system must allocate resources and recover them when appropriate.
APP: the only non-infrastructure component — the thing that attempts to do useful work.
AR: the interface between APP and RM — currently whatever job start mechanism exists (job script, mpiexec/mpirun command-line parameters, MPI_INIT) and parts of the dynamic model (MPI_COMM_SPAWN creates new MPI processes, but may not create new OS processes or allocate additional hardware CPUs).
AS: the interface between APP and SYS — currently whatever error reporting mechanisms exist, including MPI errors, but no direct interface for failures (except as implied by errors) or faults (except as guessed from errors and implied failures).
RS: the interface between RM and SYS — currently whatever reliability mechanisms exist, mostly/entirely outwith the current scope of both MPI (the interface) and MPI implementations (the library code).

Musings
—
MPI: I see the entirety of MPI (interface and library code) as the white rectangle — bits of MPI fit into every portion of this diagram.
ULFM: I see the shrink procedure as fitting within AS, attempts to replace failed processes (with spares or newly spawned ones) fit within ARS, and the application adjusting to cope with different resources fits within AR.
ReInit: I see this as extending APP into AR only — all interactions with SYS are done outwith the APP.




Cheers,
Dan.
—
Dr Daniel Holmes PhD
Executive Director
Chief Technology Officer
CHI Ltd
danholmes at chi.scot



> On 8 Apr 2021, at 09:00, Schreiber, Martin <martin.schreiber at tum.de> wrote:
> 
> Hi Dan and others,
> 
> I'm through with my iteration & I'm up for other changes / suggestions
> / acceptance of my changes, etc.
> 
> I see two major discussion points:
> 
> * The slow/medium/fast changes where I wonder if this is the right
> direction. I've tried to define this slightly different. Maybe this
> could be also defined in terms of the success of coercive requests and
> whether this is asynchronous or not. I'm not sure about this.
> 
> * Do we want to include fault tolerance? This would significantly
> complicate things.
> 
> Cheers,
> 
> Martin
> 
> 
> 
> 
> On Tue, 2021-03-30 at 19:12 +0100, Dan Holmes wrote:
>> Hi Martin (et al),
>> 
>> I just went through and added my tuppence throughout.
>> 
>> Cheers,
>> Dan.
>>>> Dr Daniel Holmes PhD
>> Executive Director
>> Chief Technology Officer
>> CHI Ltd
>> danholmes at chi.scot
>> 
>> 
>> 
>>> On 29 Mar 2021, at 19:49, Schreiber, Martin via mpiwg-sessions
>>> <mpiwg-sessions at lists.mpi-forum.org> wrote:
>>> 
>>> 
>>> Hi all,
>>> 
>>> there's the document from Today:
>>> https://drive.google.com/drive/folders/1NLAMZtH5B3bVSnWk-ZxM9FhCJesZ92rL
>>> 
>>> For those not attending the meeting: We plan to extend this
>>> document
>>> and iterate over it to generate an API covering all important
>>> aspects.
>>> 
>>> You're welcome anytime to add changes / comments (please with
>>> tracking
>>> of changes and using your real username). We very much appreciate
>>> any
>>> feedback.
>>> 
>>> Seems like it's public/bank holiday next week, so we meet again in
>>> 2
>>> weeks.
>>> 
>>> Cheers,
>>> 
>>> Martin
>>> 
>>> 
>>> 
>>> -- 
>>> Dr. Martin Schreiber
>>> +49 (89) 289-17661
>>> 
>>> Technical University of Munich
>>> Informatics 10: Computer Architecture and Parallel Systems
>>> Boltzmannstrasse 3, Room 1.4.38
>>> 85748 Garching, Germany
>>> _______________________________________________
>>> mpiwg-sessions mailing list
>>> mpiwg-sessions at lists.mpi-forum.org
>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
>> 
> 
> -- 
> Dr. Martin Schreiber
> +49 (89) 289-17661
> 
> Technical University of Munich
> Informatics 10: Computer Architecture and Parallel Systems
> Boltzmannstrasse 3, Room 1.4.38
> 85748 Garching, Germany

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20210408/0a228486/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sessions v2.0 design landscape.pdf
Type: application/pdf
Size: 45744 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20210408/0a228486/attachment-0001.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20210408/0a228486/attachment-0003.html>


More information about the mpiwg-sessions mailing list