[mpiwg-ft] [Mpi-forum] FTWG Con Call Today

Ignacio Laguna lagunaperalt1 at llnl.gov
Mon Jan 25 11:20:22 CST 2021


Michael:

It's very simple. The high-level idea is to encapsulate the main 
function into a resilient_main function, which we can call again when a 
failure occurs. In our implementation, we use set-jump/long-jump 
semantics. As long the encapsulation is done properly, branches higher 
up in the stack won't affect it.

Ignacio

On 1/25/21 9:13 AM, Steyer, Michael wrote:
> Thanks Ignacio, I'd be very interested in learning how that approach works. Especially the "goes always back to the resilient_function call 1" part, without adding another branch on the call stack?
> 
> /Michael
> 
> -----Original Message-----
> From: mpiwg-ft <mpiwg-ft-bounces at lists.mpi-forum.org> On Behalf Of Ignacio Laguna via mpiwg-ft
> Sent: Montag, 25. Januar 2021 17:33
> To: MPI WG Fault Tolerance and Dynamic Process Control working Group <mpiwg-ft at lists.mpi-forum.org>
> Cc: Ignacio Laguna <lagunaperalt1 at llnl.gov>
> Subject: Re: [mpiwg-ft] [Mpi-forum] FTWG Con Call Today
> 
> The model is that the app goes always back to the resilient_function call 1 (we cannot call this function twice or more statically in the program). Perhaps we can discuss that again.
> 
> Ignacio
> 
> 
> On 1/25/21 8:25 AM, Wesley Bland via mpiwg-ft wrote:
>> There was another question that came up in internal conversations
>> around here with Reinit:
>>
>> What's going to happen to the call stack. E.g. MPI_Init -> ... ->
>> resilient_function call 1 -> Failure -> ReInit - resilient_function
>> call
>> 2 -> End of Work -> back to resilient_function call 1?
>>
>> On Mon, Jan 25, 2021 at 10:01 AM Ignacio Laguna
>> <lagunaperalt1 at llnl.gov <mailto:lagunaperalt1 at llnl.gov>> wrote:
>>
>>      That works for me (I couldn't attend today neither).
>>
>>      We are almost done with the new Reinit spec but we have a few topics we
>>      would like to discuss in the group: (1) using several error handlers
>>      and
>>      how this is specified in the standard, (2) the state of MPI between a
>>      failure and its recovery (how does ULFM does it? Perhaps Reinit can
>>      re-use the same text?).
>>
>>      Thanks!
>>
>>      Ignacio
>>
>>      On 1/25/21 6:20 AM, Wesley Bland via mpi-forum wrote:
>>       > Hi all,
>>       >
>>       > After talking to Tony, we're going to delay this discussion until
>>      the
>>       > next call on Feb 8. Today's call is cancelled.
>>       >
>>       > Thanks,
>>       > Wes
>>       >
>>       > On Mon, Jan 25, 2021 at 8:15 AM work at wesbland.com
>>      <mailto:work at wesbland.com>
>>       > <mailto:work at wesbland.com <mailto:work at wesbland.com>>
>>      <work at wesbland.com <mailto:work at wesbland.com>
>>       > <mailto:work at wesbland.com <mailto:work at wesbland.com>>> wrote:
>>       >
>>       >     The Fault Tolerance Working Group’s weekly con call is today at
>>       >     12:00 PM Eastern. Today's agenda:____
>>       >
>>       >     __ __
>>       >
>>       >     * FA-MPI (Tony)____
>>       >
>>       >     * Other updates (All)____
>>       >
>>       >     __ __
>>       >
>>       >     If there's something else that people would like to discuss,
>>      please
>>       >     just send an email to the WG so we can get it on the agenda.____
>>       >
>>       >     __ __
>>       >
>>       >     Thanks, ____
>>       >
>>       >     Wes ____
>>       >
>>       >     __ __
>>       >
>>       >
>>        .......................................................................................................................................
>>       >     ____
>>       >
>>       >     Join from PC, Mac, Linux, iOS or Android:
>>       > https://tennessee.zoom.us/j/632356722?pwd=lI4_169CGcewIumekTziMw____
>>       >
>>       >          Password: mpiforum____
>>       >
>>       >     __ __
>>       >
>>       >     Or iPhone one-tap (US Toll):  +16468769923,632356722#  or
>>       >     +16699006833,632356722# ____
>>       >
>>       >     __ __
>>       >
>>       >     Or Telephone:____
>>       >
>>       >          Dial:____
>>       >
>>       >          +1 646 876 9923 (US Toll)____
>>       >
>>       >          +1 669 900 6833 (US Toll)____
>>       >
>>       >          Meeting ID: 632 356 722____
>>       >
>>       >          International numbers available: https://zoom.us/u/6uINe____
>>       >
>>       >     __ __
>>       >
>>       >     Or an H.323/SIP room system:____
>>       >
>>       >          H.323: 162.255.37.11 (US West) or 162.255.36.11 (US
>>      East) ____
>>       >
>>       >          Meeting ID: 632 356 722____
>>       >
>>       >          Password: 364216____
>>       >
>>       >     __ __
>>       >
>>       >          SIP: 632356722 at zoomcrc.com
>>      <mailto:632356722 at zoomcrc.com> <mailto:632356722 at zoomcrc.com
>>      <mailto:632356722 at zoomcrc.com>>____
>>       >
>>       >          Password: 364216____
>>       >
>>       >
>>        .......................................................................................................................................____
>>       >
>>       >
>>       > _______________________________________________
>>       > mpi-forum mailing list
>>       > mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>>       > https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>>       >
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Gary Kershaw
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
> 


More information about the mpiwg-ft mailing list