[mpiwg-ft] [EXTERNAL] Re: FTWG Call Today

Teranishi, Keita knteran at sandia.gov
Wed Mar 15 13:22:23 CDT 2017


Ignacio,

Does your technique creates  replacement of main() (say main_reinit()) that makes a setjump() call inside?  It’s interesting.  Many scientific libraries make MPI_Init() call inside their initialization functions (such as PETSc_initialize() and BLACS_Init() ).  I am not 100% sure how PETSC_Initialize() can return to the replacement of main(). Could you clarify the behavior of these functions maiking MPI_Init() call.

BTW (including SC14 version), Fenix_init() is  a macro that is expanded to three function calls.  So the user cannot call outside main() ☹.
Fenix_preinit();
Setjump();
Fenix_postinit();

For this reason, when using PETSc with Fenix, I have to expose fenix_init() to main().  I cannot put inside petsc_initialize().  After all, I ended up wroting petsc_reintialize() to modify the contents created by  petsc_initialize().    If your approach works, I can put  Fenix_init() and petsc_reinitalize_fenix() inside petsc_initialize(), making the code much cleaner.

Main()
{
       petsc_initialize();  <= this is calling MPI_Init();
       Fenix_init();
       petsc_reinitialize_fenix();  
       :
       :
       :
}

Thanks,
Keita



On 3/15/17, 10:46 AM, "mpiwg-ft-bounces at lists.mpi-forum.org on behalf of Ignacio Laguna" <mpiwg-ft-bounces at lists.mpi-forum.org on behalf of lagunaperalt1 at llnl.gov> wrote:

    Hey Aurelien,
    
    Thanks! I understand the concern.
    
    For gloabal-restart models like Reinit (and I believe that for the SC14 
    version of Fenix) this problem is solved by passing a reinit function 
    pointer to MPI, which it then calls after initialization (this function 
    is a replacement of main, and has the code that main originally 
    contained). Since this reinit function is kept in the stack (it never 
    returns), we can always long jump there.
    
    I think the main problem is that we cannot long jump from a signal 
    handler, or more specifically it is undefined according to the C 
    language. We would need to find another mechanism for long jumping after 
    a signal handler is called as a result of a failure notification.
    
    Ignacio
    
    
    On 3/15/17 8:41 AM, Aurelien Bouteiller wrote:
    >
    > Hey Ignacio,
    >
    > Murali wanted to touch with you on that exact issue. The bottom line is
    > that a setjump must be in the same stack frame as the long jump, which
    > means that you can jump only to a function in which you are nested in.
    > In many cases that means you can’t “hide” set jumps points in the
    > library, as they have to be called in the application function context
    > (so that they remain in your frame).
    >
    > Best,
    > Aurelien
    >
    >> On Mar 14, 2017, at 18:15, Ignacio Laguna <lagunaperalt1 at llnl.gov
    >> <mailto:lagunaperalt1 at llnl.gov>> wrote:
    >>
    >> Thanks for sharing the minutes.
    >>
    >> In the "scoped reinit-like approaches", there is the point of "still
    >> subject to the longjmp complication". Can folks comment on what is the
    >> issue with respect to setjump/longjump in global-restart approaches,
    >> such as Reinit and/or Fenix?
    >>
    >> Thanks!
    >>
    >> Ignacio
    >>
    >>
    >> On 3/14/17 1:49 PM, Aurelien Bouteiller wrote:
    >>> Minutes for the call have been posted here:
    >>> https://github.com/mpiwg-ft/ft-issues/wiki/2017-03-14
    >>>
    >>>> On Mar 14, 2017, at 15:00, Aurelien Bouteiller <bouteill at icl.utk.edu
    >>>> <mailto:bouteill at icl.utk.edu>
    >>>> <mailto:bouteill at icl.utk.edu>> wrote:
    >>>>
    >>>> Hi there,
    >>>>
    >>>> Aurelien Bouteiller is inviting you to a scheduled Zoom meeting.
    >>>>
    >>>> Topic: MPI FT WG
    >>>> Time: Mar 14, 2017 3:00 PM Eastern Time (US and Canada)
    >>>>
    >>>> Join from PC, Mac, Linux, iOS or
    >>>> Android: https://tennessee.zoom.us/j/607816420?pwd=MuG6Nboy9%2Fo%3D
    >>>>    Password: beef
    >>>>
    >>>> Or iPhone one-tap (US Toll):  +14086380968,607816420# or
    >>>> +16465588656,607816420#
    >>>>
    >>>> Or Telephone:
    >>>>    Dial: +1 408 638 0968 (US Toll) or +1 646 558 8656 (US Toll)
    >>>>    Meeting ID: 607 816 420
    >>>>    International numbers
    >>>> available: https://tennessee.zoom.us/zoomconference?m=fUOjmMyJwtMIeEsk8yo8CgLo3JR6yrTM
    >>>>
    >>>> Or an H.323/SIP room system:
    >>>>    H.323: 162.255.37.11 (US West) or 162.255.36.11 (US East)
    >>>>    Meeting ID: 607 816 420
    >>>>    Password: 463530
    >>>>
    >>>>    SIP: 607816420 at zoomcrc.com
    >>>> <mailto:607816420 at zoomcrc.com> <mailto:607816420 at zoomcrc.com>
    >>>>    Password: 463530
    >>>>
    >>>>
    >>>>
    >>>>> On Mar 14, 2017, at 10:54, Aurelien Bouteiller
    >>>>> <bouteill at icl.utk.edu <mailto:bouteill at icl.utk.edu>
    >>>>> <mailto:bouteill at icl.utk.edu>> wrote:
    >>>>>
    >>>>> Hi all,
    >>>>>
    >>>>> We have the FTWG call scheduled for today. I’d like to debrief the
    >>>>> latest MPI forum activities, and continue the discussion on
    >>>>> converging localized and globalized recovery.
    >>>>>
    >>>>> I attach here the slide I used during the WG time.
    >>>>> <20170228-mpiforum-errwg.pptx>
    >>>>>
    >>>>> We may also want to decide the time for our future meeting based on
    >>>>> the doodle poll initiated by Wesley a while back.
    >>>>> http://doodle.com/poll/s5uvmpux4nc6ki4y#table
    >>>>>
    >>>>> ===
    >>>>> Looking back at the notes from our last call in December, I believe
    >>>>> the TODO items are for Aurelien, Ignacio, and myself to flesh out the
    >>>>> three FT recovery proposals and then see how they would interact with
    >>>>> each other.
    >>>>>
    >>>>> * I believe Aurelien had some ideas about how to overcome some of the
    >>>>> problems raised at the last meeting. Aurelien, if you could put
    >>>>> together a slide or two that we could use for the discussion, that
    >>>>> would probably be helpful.
    >>>>> * I'm not sure of the status of Ignacio putting together some slides
    >>>>> for the reinit proposal. If I remember the meeting long ago in San
    >>>>> Jose, we just looked at a header. It might be nice to have something
    >>>>> a little more high level to point to.
    >>>>> * I still need to make the slides for the auto recovery strategy that
    >>>>> Martin proposed.
    >>>>>
    >>>>> Once that's done, we can see where these things interact and how
    >>>>> difficult it would be to support them together.
    >>>>>
    >>>>> Thoughts?
    >>>>> Wesley
    >>>>> _______________________________________________
    >>>>> mpiwg-ft mailing list
    >>>>> mpiwg-ft at lists.mpi-forum.org
    >>>>> <mailto:mpiwg-ft at lists.mpi-forum.org> <mailto:mpiwg-ft at lists.mpi-forum.org>
    >>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
    >>>>
    >>>
    >>>
    >>>
    >>> _______________________________________________
    >>> mpiwg-ft mailing list
    >>> mpiwg-ft at lists.mpi-forum.org <mailto:mpiwg-ft at lists.mpi-forum.org>
    >>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
    >>>
    >> _______________________________________________
    >> mpiwg-ft mailing list
    >> mpiwg-ft at lists.mpi-forum.org <mailto:mpiwg-ft at lists.mpi-forum.org>
    >> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
    >
    _______________________________________________
    mpiwg-ft mailing list
    mpiwg-ft at lists.mpi-forum.org
    https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft



More information about the mpiwg-ft mailing list