[mpiwg-ft] [EXTERNAL] Re: [Mpi-forum] FTWG Con Call Today
Teranishi, Keita
knteran at sandia.gov
Mon Jan 25 12:16:30 CST 2021
All,
From my experience with Fenix (our ULFM-based recovery model similar to ReInit), we found that this approach works with flat-MPI mode.
However, if the program is using an exotic runtime (such as HCLIB, on-node task parallel runtime from GaTech), setjump/longjump messes up the program.
This required us to either (1) refactor the runtime to be MPI-Aware, or give up setjump/longjump approach for the work presented in the last ExaMPI workshop.
I am wondering how to handle a program with OpenMP/CUDA (and AMD and Intel accelerator runtime) in the setjump-longjump recovery model. It appears that it is safe to clean-up the runtime before calling longjump. With OpenMP, longjump should happen outside any OpenMP pragama (it is easy to implement. I think). With GPU/Accelerator model, all kernel executions should be completed or cancelled. My question is that who is responsible for this type of clean-up. Is it user's responsibility?
Regards,
Keita
On 1/25/21, 9:20 AM, "mpiwg-ft on behalf of Ignacio Laguna via mpiwg-ft" <mpiwg-ft-bounces at lists.mpi-forum.org on behalf of mpiwg-ft at lists.mpi-forum.org> wrote:
Michael:
It's very simple. The high-level idea is to encapsulate the main
function into a resilient_main function, which we can call again when a
failure occurs. In our implementation, we use set-jump/long-jump
semantics. As long the encapsulation is done properly, branches higher
up in the stack won't affect it.
Ignacio
On 1/25/21 9:13 AM, Steyer, Michael wrote:
> Thanks Ignacio, I'd be very interested in learning how that approach works. Especially the "goes always back to the resilient_function call 1" part, without adding another branch on the call stack?
>
> /Michael
>
> -----Original Message-----
> From: mpiwg-ft <mpiwg-ft-bounces at lists.mpi-forum.org> On Behalf Of Ignacio Laguna via mpiwg-ft
> Sent: Montag, 25. Januar 2021 17:33
> To: MPI WG Fault Tolerance and Dynamic Process Control working Group <mpiwg-ft at lists.mpi-forum.org>
> Cc: Ignacio Laguna <lagunaperalt1 at llnl.gov>
> Subject: Re: [mpiwg-ft] [Mpi-forum] FTWG Con Call Today
>
> The model is that the app goes always back to the resilient_function call 1 (we cannot call this function twice or more statically in the program). Perhaps we can discuss that again.
>
> Ignacio
>
>
> On 1/25/21 8:25 AM, Wesley Bland via mpiwg-ft wrote:
>> There was another question that came up in internal conversations
>> around here with Reinit:
>>
>> What's going to happen to the call stack. E.g. MPI_Init -> ... ->
>> resilient_function call 1 -> Failure -> ReInit - resilient_function
>> call
>> 2 -> End of Work -> back to resilient_function call 1?
>>
>> On Mon, Jan 25, 2021 at 10:01 AM Ignacio Laguna
>> <lagunaperalt1 at llnl.gov <mailto:lagunaperalt1 at llnl.gov>> wrote:
>>
>> That works for me (I couldn't attend today neither).
>>
>> We are almost done with the new Reinit spec but we have a few topics we
>> would like to discuss in the group: (1) using several error handlers
>> and
>> how this is specified in the standard, (2) the state of MPI between a
>> failure and its recovery (how does ULFM does it? Perhaps Reinit can
>> re-use the same text?).
>>
>> Thanks!
>>
>> Ignacio
>>
>> On 1/25/21 6:20 AM, Wesley Bland via mpi-forum wrote:
>> > Hi all,
>> >
>> > After talking to Tony, we're going to delay this discussion until
>> the
>> > next call on Feb 8. Today's call is cancelled.
>> >
>> > Thanks,
>> > Wes
>> >
>> > On Mon, Jan 25, 2021 at 8:15 AM work at wesbland.com
>> <mailto:work at wesbland.com>
>> > <mailto:work at wesbland.com <mailto:work at wesbland.com>>
>> <work at wesbland.com <mailto:work at wesbland.com>
>> > <mailto:work at wesbland.com <mailto:work at wesbland.com>>> wrote:
>> >
>> > The Fault Tolerance Working Group’s weekly con call is today at
>> > 12:00 PM Eastern. Today's agenda:____
>> >
>> > __ __
>> >
>> > * FA-MPI (Tony)____
>> >
>> > * Other updates (All)____
>> >
>> > __ __
>> >
>> > If there's something else that people would like to discuss,
>> please
>> > just send an email to the WG so we can get it on the agenda.____
>> >
>> > __ __
>> >
>> > Thanks, ____
>> >
>> > Wes ____
>> >
>> > __ __
>> >
>> >
>> .......................................................................................................................................
>> > ____
>> >
>> > Join from PC, Mac, Linux, iOS or Android:
>> > https://tennessee.zoom.us/j/632356722?pwd=lI4_169CGcewIumekTziMw____
>> >
>> > Password: mpiforum____
>> >
>> > __ __
>> >
>> > Or iPhone one-tap (US Toll): +16468769923,632356722# or
>> > +16699006833,632356722# ____
>> >
>> > __ __
>> >
>> > Or Telephone:____
>> >
>> > Dial:____
>> >
>> > +1 646 876 9923 (US Toll)____
>> >
>> > +1 669 900 6833 (US Toll)____
>> >
>> > Meeting ID: 632 356 722____
>> >
>> > International numbers available: https://zoom.us/u/6uINe____
>> >
>> > __ __
>> >
>> > Or an H.323/SIP room system:____
>> >
>> > H.323: 162.255.37.11 (US West) or 162.255.36.11 (US
>> East) ____
>> >
>> > Meeting ID: 632 356 722____
>> >
>> > Password: 364216____
>> >
>> > __ __
>> >
>> > SIP: 632356722 at zoomcrc.com
>> <mailto:632356722 at zoomcrc.com> <mailto:632356722 at zoomcrc.com
>> <mailto:632356722 at zoomcrc.com>>____
>> >
>> > Password: 364216____
>> >
>> >
>> .......................................................................................................................................____
>> >
>> >
>> > _______________________________________________
>> > mpi-forum mailing list
>> > mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>> > https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>> >
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
> Intel Deutschland GmbH
> Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
> Tel: +49 89 99 8853-0, www.intel.de
> Managing Directors: Christin Eisenschmid, Gary Kershaw
> Chairperson of the Supervisory Board: Nicole Lau
> Registered Office: Munich
> Commercial Register: Amtsgericht Muenchen HRB 186928
>
_______________________________________________
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
More information about the mpiwg-ft
mailing list