[mpiwg-ft] [EXTERNAL] Re: [Mpi-forum] FTWG Con Call Today

Ignacio Laguna lagunaperalt1 at llnl.gov
Mon Jan 25 12:31:37 CST 2021


Hi Keita,

Good question. We have two operations mode, asynchronous and 
synchronous. In the asynchronous case, the application automatically 
jumps back to the reinit point. In the synchronous case, the user places 
MPI_test_failure calls to check for failures and jump back. The latter 
is to deal with OpenMP regions because we don't want to jump back in the 
middle of them.

With GPU/Accelerator model, host code will get an error from a kernel 
launch and will trigger recovery after that. Th user is responsible for 
cleaning up state in the GPU as well as calling libraries to clean up 
their state as well.

I hope that answers the question.

Ignacio

On 1/25/21 10:16 AM, Teranishi, Keita wrote:
> All,
> 
>  From my experience with Fenix (our ULFM-based recovery model similar to ReInit), we found that this approach works with flat-MPI mode.
> However, if the program is using an exotic runtime (such as HCLIB, on-node task parallel runtime from GaTech), setjump/longjump messes up the program.
> This required us to either (1) refactor the runtime to be MPI-Aware, or give up setjump/longjump approach for the work presented in the last ExaMPI workshop.
> 
> I am wondering how to handle a program with OpenMP/CUDA (and AMD and Intel accelerator runtime) in the setjump-longjump recovery model. It appears that it is safe to clean-up the runtime before calling longjump.  With OpenMP, longjump should happen outside any OpenMP pragama (it is easy to implement. I think).   With GPU/Accelerator model, all kernel executions should be completed or cancelled.  My question is that who is responsible for this type of clean-up. Is it user's responsibility?
> 
> Regards,
> Keita
> 
> On 1/25/21, 9:20 AM, "mpiwg-ft on behalf of Ignacio Laguna via mpiwg-ft" <mpiwg-ft-bounces at lists.mpi-forum.org on behalf of mpiwg-ft at lists.mpi-forum.org> wrote:
> 
>      Michael:
>      
>      It's very simple. The high-level idea is to encapsulate the main
>      function into a resilient_main function, which we can call again when a
>      failure occurs. In our implementation, we use set-jump/long-jump
>      semantics. As long the encapsulation is done properly, branches higher
>      up in the stack won't affect it.
>      
>      Ignacio
>      
>      On 1/25/21 9:13 AM, Steyer, Michael wrote:
>      > Thanks Ignacio, I'd be very interested in learning how that approach works. Especially the "goes always back to the resilient_function call 1" part, without adding another branch on the call stack?
>      >
>      > /Michael
>      >
>      > -----Original Message-----
>      > From: mpiwg-ft <mpiwg-ft-bounces at lists.mpi-forum.org> On Behalf Of Ignacio Laguna via mpiwg-ft
>      > Sent: Montag, 25. Januar 2021 17:33
>      > To: MPI WG Fault Tolerance and Dynamic Process Control working Group <mpiwg-ft at lists.mpi-forum.org>
>      > Cc: Ignacio Laguna <lagunaperalt1 at llnl.gov>
>      > Subject: Re: [mpiwg-ft] [Mpi-forum] FTWG Con Call Today
>      >
>      > The model is that the app goes always back to the resilient_function call 1 (we cannot call this function twice or more statically in the program). Perhaps we can discuss that again.
>      >
>      > Ignacio
>      >
>      >
>      > On 1/25/21 8:25 AM, Wesley Bland via mpiwg-ft wrote:
>      >> There was another question that came up in internal conversations
>      >> around here with Reinit:
>      >>
>      >> What's going to happen to the call stack. E.g. MPI_Init -> ... ->
>      >> resilient_function call 1 -> Failure -> ReInit - resilient_function
>      >> call
>      >> 2 -> End of Work -> back to resilient_function call 1?
>      >>
>      >> On Mon, Jan 25, 2021 at 10:01 AM Ignacio Laguna
>      >> <lagunaperalt1 at llnl.gov <mailto:lagunaperalt1 at llnl.gov>> wrote:
>      >>
>      >>      That works for me (I couldn't attend today neither).
>      >>
>      >>      We are almost done with the new Reinit spec but we have a few topics we
>      >>      would like to discuss in the group: (1) using several error handlers
>      >>      and
>      >>      how this is specified in the standard, (2) the state of MPI between a
>      >>      failure and its recovery (how does ULFM does it? Perhaps Reinit can
>      >>      re-use the same text?).
>      >>
>      >>      Thanks!
>      >>
>      >>      Ignacio
>      >>
>      >>      On 1/25/21 6:20 AM, Wesley Bland via mpi-forum wrote:
>      >>       > Hi all,
>      >>       >
>      >>       > After talking to Tony, we're going to delay this discussion until
>      >>      the
>      >>       > next call on Feb 8. Today's call is cancelled.
>      >>       >
>      >>       > Thanks,
>      >>       > Wes
>      >>       >
>      >>       > On Mon, Jan 25, 2021 at 8:15 AM work at wesbland.com
>      >>      <mailto:work at wesbland.com>
>      >>       > <mailto:work at wesbland.com <mailto:work at wesbland.com>>
>      >>      <work at wesbland.com <mailto:work at wesbland.com>
>      >>       > <mailto:work at wesbland.com <mailto:work at wesbland.com>>> wrote:
>      >>       >
>      >>       >     The Fault Tolerance Working Group’s weekly con call is today at
>      >>       >     12:00 PM Eastern. Today's agenda:____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >     * FA-MPI (Tony)____
>      >>       >
>      >>       >     * Other updates (All)____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >     If there's something else that people would like to discuss,
>      >>      please
>      >>       >     just send an email to the WG so we can get it on the agenda.____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >     Thanks, ____
>      >>       >
>      >>       >     Wes ____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >
>      >>        .......................................................................................................................................
>      >>       >     ____
>      >>       >
>      >>       >     Join from PC, Mac, Linux, iOS or Android:
>      >>       > https://tennessee.zoom.us/j/632356722?pwd=lI4_169CGcewIumekTziMw____
>      >>       >
>      >>       >          Password: mpiforum____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >     Or iPhone one-tap (US Toll):  +16468769923,632356722#  or
>      >>       >     +16699006833,632356722# ____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >     Or Telephone:____
>      >>       >
>      >>       >          Dial:____
>      >>       >
>      >>       >          +1 646 876 9923 (US Toll)____
>      >>       >
>      >>       >          +1 669 900 6833 (US Toll)____
>      >>       >
>      >>       >          Meeting ID: 632 356 722____
>      >>       >
>      >>       >          International numbers available: https://zoom.us/u/6uINe____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >     Or an H.323/SIP room system:____
>      >>       >
>      >>       >          H.323: 162.255.37.11 (US West) or 162.255.36.11 (US
>      >>      East) ____
>      >>       >
>      >>       >          Meeting ID: 632 356 722____
>      >>       >
>      >>       >          Password: 364216____
>      >>       >
>      >>       >     __ __
>      >>       >
>      >>       >          SIP: 632356722 at zoomcrc.com
>      >>      <mailto:632356722 at zoomcrc.com> <mailto:632356722 at zoomcrc.com
>      >>      <mailto:632356722 at zoomcrc.com>>____
>      >>       >
>      >>       >          Password: 364216____
>      >>       >
>      >>       >
>      >>        .......................................................................................................................................____
>      >>       >
>      >>       >
>      >>       > _______________________________________________
>      >>       > mpi-forum mailing list
>      >>       > mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
>      >>       > https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>      >>       >
>      >>
>      >>
>      >> _______________________________________________
>      >> mpiwg-ft mailing list
>      >> mpiwg-ft at lists.mpi-forum.org
>      >> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>      >>
>      > _______________________________________________
>      > mpiwg-ft mailing list
>      > mpiwg-ft at lists.mpi-forum.org
>      > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>      > Intel Deutschland GmbH
>      > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
>      > Tel: +49 89 99 8853-0, www.intel.de
>      > Managing Directors: Christin Eisenschmid, Gary Kershaw
>      > Chairperson of the Supervisory Board: Nicole Lau
>      > Registered Office: Munich
>      > Commercial Register: Amtsgericht Muenchen HRB 186928
>      >
>      _______________________________________________
>      mpiwg-ft mailing list
>      mpiwg-ft at lists.mpi-forum.org
>      https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>      
> 


More information about the mpiwg-ft mailing list