[mpiwg-ft] [EXTERNAL] Re: [Mpi-forum] FTWG Con Call Today

Teranishi, Keita knteran at sandia.gov
Mon Jan 25 12:16:30 CST 2021


All,

From my experience with Fenix (our ULFM-based recovery model similar to ReInit), we found that this approach works with flat-MPI mode. 
However, if the program is using an exotic runtime (such as HCLIB, on-node task parallel runtime from GaTech), setjump/longjump messes up the program.  
This required us to either (1) refactor the runtime to be MPI-Aware, or give up setjump/longjump approach for the work presented in the last ExaMPI workshop.

I am wondering how to handle a program with OpenMP/CUDA (and AMD and Intel accelerator runtime) in the setjump-longjump recovery model. It appears that it is safe to clean-up the runtime before calling longjump.  With OpenMP, longjump should happen outside any OpenMP pragama (it is easy to implement. I think).   With GPU/Accelerator model, all kernel executions should be completed or cancelled.  My question is that who is responsible for this type of clean-up. Is it user's responsibility?

Regards,
Keita

On 1/25/21, 9:20 AM, "mpiwg-ft on behalf of Ignacio Laguna via mpiwg-ft" <mpiwg-ft-bounces at lists.mpi-forum.org on behalf of mpiwg-ft at lists.mpi-forum.org> wrote:

    Michael:
    
    It's very simple. The high-level idea is to encapsulate the main 
    function into a resilient_main function, which we can call again when a 
    failure occurs. In our implementation, we use set-jump/long-jump 
    semantics. As long the encapsulation is done properly, branches higher 
    up in the stack won't affect it.
    
    Ignacio
    
    On 1/25/21 9:13 AM, Steyer, Michael wrote:
    > Thanks Ignacio, I'd be very interested in learning how that approach works. Especially the "goes always back to the resilient_function call 1" part, without adding another branch on the call stack?
    > 
    > /Michael
    > 
    > -----Original Message-----
    > From: mpiwg-ft <mpiwg-ft-bounces at lists.mpi-forum.org> On Behalf Of Ignacio Laguna via mpiwg-ft
    > Sent: Montag, 25. Januar 2021 17:33
    > To: MPI WG Fault Tolerance and Dynamic Process Control working Group <mpiwg-ft at lists.mpi-forum.org>
    > Cc: Ignacio Laguna <lagunaperalt1 at llnl.gov>
    > Subject: Re: [mpiwg-ft] [Mpi-forum] FTWG Con Call Today
    > 
    > The model is that the app goes always back to the resilient_function call 1 (we cannot call this function twice or more statically in the program). Perhaps we can discuss that again.
    > 
    > Ignacio
    > 
    > 
    > On 1/25/21 8:25 AM, Wesley Bland via mpiwg-ft wrote:
    >> There was another question that came up in internal conversations
    >> around here with Reinit:
    >>
    >> What's going to happen to the call stack. E.g. MPI_Init -> ... ->
    >> resilient_function call 1 -> Failure -> ReInit - resilient_function
    >> call
    >> 2 -> End of Work -> back to resilient_function call 1?
    >>
    >> On Mon, Jan 25, 2021 at 10:01 AM Ignacio Laguna
    >> <lagunaperalt1 at llnl.gov <mailto:lagunaperalt1 at llnl.gov>> wrote:
    >>
    >>      That works for me (I couldn't attend today neither).
    >>
    >>      We are almost done with the new Reinit spec but we have a few topics we
    >>      would like to discuss in the group: (1) using several error handlers
    >>      and
    >>      how this is specified in the standard, (2) the state of MPI between a
    >>      failure and its recovery (how does ULFM does it? Perhaps Reinit can
    >>      re-use the same text?).
    >>
    >>      Thanks!
    >>
    >>      Ignacio
    >>
    >>      On 1/25/21 6:20 AM, Wesley Bland via mpi-forum wrote:
    >>       > Hi all,
    >>       >
    >>       > After talking to Tony, we're going to delay this discussion until
    >>      the
    >>       > next call on Feb 8. Today's call is cancelled.
    >>       >
    >>       > Thanks,
    >>       > Wes
    >>       >
    >>       > On Mon, Jan 25, 2021 at 8:15 AM work at wesbland.com
    >>      <mailto:work at wesbland.com>
    >>       > <mailto:work at wesbland.com <mailto:work at wesbland.com>>
    >>      <work at wesbland.com <mailto:work at wesbland.com>
    >>       > <mailto:work at wesbland.com <mailto:work at wesbland.com>>> wrote:
    >>       >
    >>       >     The Fault Tolerance Working Group’s weekly con call is today at
    >>       >     12:00 PM Eastern. Today's agenda:____
    >>       >
    >>       >     __ __
    >>       >
    >>       >     * FA-MPI (Tony)____
    >>       >
    >>       >     * Other updates (All)____
    >>       >
    >>       >     __ __
    >>       >
    >>       >     If there's something else that people would like to discuss,
    >>      please
    >>       >     just send an email to the WG so we can get it on the agenda.____
    >>       >
    >>       >     __ __
    >>       >
    >>       >     Thanks, ____
    >>       >
    >>       >     Wes ____
    >>       >
    >>       >     __ __
    >>       >
    >>       >
    >>        .......................................................................................................................................
    >>       >     ____
    >>       >
    >>       >     Join from PC, Mac, Linux, iOS or Android:
    >>       > https://tennessee.zoom.us/j/632356722?pwd=lI4_169CGcewIumekTziMw____
    >>       >
    >>       >          Password: mpiforum____
    >>       >
    >>       >     __ __
    >>       >
    >>       >     Or iPhone one-tap (US Toll):  +16468769923,632356722#  or
    >>       >     +16699006833,632356722# ____
    >>       >
    >>       >     __ __
    >>       >
    >>       >     Or Telephone:____
    >>       >
    >>       >          Dial:____
    >>       >
    >>       >          +1 646 876 9923 (US Toll)____
    >>       >
    >>       >          +1 669 900 6833 (US Toll)____
    >>       >
    >>       >          Meeting ID: 632 356 722____
    >>       >
    >>       >          International numbers available: https://zoom.us/u/6uINe____
    >>       >
    >>       >     __ __
    >>       >
    >>       >     Or an H.323/SIP room system:____
    >>       >
    >>       >          H.323: 162.255.37.11 (US West) or 162.255.36.11 (US
    >>      East) ____
    >>       >
    >>       >          Meeting ID: 632 356 722____
    >>       >
    >>       >          Password: 364216____
    >>       >
    >>       >     __ __
    >>       >
    >>       >          SIP: 632356722 at zoomcrc.com
    >>      <mailto:632356722 at zoomcrc.com> <mailto:632356722 at zoomcrc.com
    >>      <mailto:632356722 at zoomcrc.com>>____
    >>       >
    >>       >          Password: 364216____
    >>       >
    >>       >
    >>        .......................................................................................................................................____
    >>       >
    >>       >
    >>       > _______________________________________________
    >>       > mpi-forum mailing list
    >>       > mpi-forum at lists.mpi-forum.org <mailto:mpi-forum at lists.mpi-forum.org>
    >>       > https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
    >>       >
    >>
    >>
    >> _______________________________________________
    >> mpiwg-ft mailing list
    >> mpiwg-ft at lists.mpi-forum.org
    >> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
    >>
    > _______________________________________________
    > mpiwg-ft mailing list
    > mpiwg-ft at lists.mpi-forum.org
    > https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
    > Intel Deutschland GmbH
    > Registered Address: Am Campeon 10-12, 85579 Neubiberg, Germany
    > Tel: +49 89 99 8853-0, www.intel.de
    > Managing Directors: Christin Eisenschmid, Gary Kershaw
    > Chairperson of the Supervisory Board: Nicole Lau
    > Registered Office: Munich
    > Commercial Register: Amtsgericht Muenchen HRB 186928
    > 
    _______________________________________________
    mpiwg-ft mailing list
    mpiwg-ft at lists.mpi-forum.org
    https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
    



More information about the mpiwg-ft mailing list