[mpiwg-ft] [EXTERNAL] Re: FTWG Call Today

Teranishi, Keita knteran at sandia.gov
Wed Mar 15 15:15:13 CDT 2017


FT-WG,

I am thinking about presenting the experience of porting scientific libraries using Fenix and Reinit approach in the Forum in September.  I could go to the June Forum, but I cannot attend the first 6/14 due to my other commitment until 6/12.

Thanks,
Keita

On 3/15/17, 1:00 PM, "mpiwg-ft-bounces at lists.mpi-forum.org on behalf of Teranishi, Keita" <mpiwg-ft-bounces at lists.mpi-forum.org on behalf of knteran at sandia.gov> wrote:

    Ignacio,
    
    I see your point!  Yes, this is a viable approach to make MPI_Reinit provide a bulk transaction mechanism by taking a function pointer of main_resilient.  Todd’s example program is very clear.  Like I did with Fenix, “reinitialization” of scientific library needs to be written separately.  Fenix API provides callbacks (taking function pointers) to make it clean, and I think this can be done in MPI_reinit API, too.
      
    I agree on your another concern on signal handling.  It should be a topic of the next meeting.
    
    Thanks,
    Keita
    
    On 3/15/17, 11:58 AM, "Ignacio Laguna" <lagunaperalt1 at llnl.gov> wrote:
    
        Hi Keita,
        
        Yes and no :-) Sorry I was unclear in my explanation.
        
        There is the main function and we have a main_resilient version which is 
        the one that contains most of the computation code. A pointer of this 
        function is passed to a new MPI function, MPI_Reinit (so MPI_Init keeps 
        its original semantics).
        
        Yes, some libraries call MPI_Init internally. I think that is not a 
        problem as long as the main_resilient does not contain calls to library 
        functions that initialize MPI. For example, main_resilient should not 
        contain PETSc_initialize() or BLACS_Init().
        
        Take a look at the C interface that Todd Gamblin wrote -- look at the 
        example.c:
        
        https://github.com/tgamblin/mpi-resilience
        
        Ignacio
        
        
        On 3/15/17 11:22 AM, Teranishi, Keita wrote:
        > Ignacio,
        >
        > Does your technique creates  replacement of main() (say main_reinit()) that makes a setjump() call inside?  It’s interesting.  Many scientific libraries make MPI_Init() call inside their initialization functions (such as PETSc_initialize() and BLACS_Init() ).  I am not 100% sure how PETSC_Initialize() can return to the replacement of main(). Could you clarify the behavior of these functions maiking MPI_Init() call.
        >
        > BTW (including SC14 version), Fenix_init() is  a macro that is expanded to three function calls.  So the user cannot call outside main() ☹.
        > Fenix_preinit();
        > Setjump();
        > Fenix_postinit();
        >
        > For this reason, when using PETSc with Fenix, I have to expose fenix_init() to main().  I cannot put inside petsc_initialize().  After all, I ended up wroting petsc_reintialize() to modify the contents created by  petsc_initialize().    If your approach works, I can put  Fenix_init() and petsc_reinitalize_fenix() inside petsc_initialize(), making the code much cleaner.
        >
        > Main()
        > {
        >        petsc_initialize();  <= this is calling MPI_Init();
        >        Fenix_init();
        >        petsc_reinitialize_fenix();
        >        :
        >        :
        >        :
        > }
        >
        > Thanks,
        > Keita
        >
        >
        >
        > On 3/15/17, 10:46 AM, "mpiwg-ft-bounces at lists.mpi-forum.org on behalf of Ignacio Laguna" <mpiwg-ft-bounces at lists.mpi-forum.org on behalf of lagunaperalt1 at llnl.gov> wrote:
        >
        >     Hey Aurelien,
        >
        >     Thanks! I understand the concern.
        >
        >     For gloabal-restart models like Reinit (and I believe that for the SC14
        >     version of Fenix) this problem is solved by passing a reinit function
        >     pointer to MPI, which it then calls after initialization (this function
        >     is a replacement of main, and has the code that main originally
        >     contained). Since this reinit function is kept in the stack (it never
        >     returns), we can always long jump there.
        >
        >     I think the main problem is that we cannot long jump from a signal
        >     handler, or more specifically it is undefined according to the C
        >     language. We would need to find another mechanism for long jumping after
        >     a signal handler is called as a result of a failure notification.
        >
        >     Ignacio
        >
        >
        >     On 3/15/17 8:41 AM, Aurelien Bouteiller wrote:
        >     >
        >     > Hey Ignacio,
        >     >
        >     > Murali wanted to touch with you on that exact issue. The bottom line is
        >     > that a setjump must be in the same stack frame as the long jump, which
        >     > means that you can jump only to a function in which you are nested in.
        >     > In many cases that means you can’t “hide” set jumps points in the
        >     > library, as they have to be called in the application function context
        >     > (so that they remain in your frame).
        >     >
        >     > Best,
        >     > Aurelien
        >     >
        >     >> On Mar 14, 2017, at 18:15, Ignacio Laguna <lagunaperalt1 at llnl.gov
        >     >> <mailto:lagunaperalt1 at llnl.gov>> wrote:
        >     >>
        >     >> Thanks for sharing the minutes.
        >     >>
        >     >> In the "scoped reinit-like approaches", there is the point of "still
        >     >> subject to the longjmp complication". Can folks comment on what is the
        >     >> issue with respect to setjump/longjump in global-restart approaches,
        >     >> such as Reinit and/or Fenix?
        >     >>
        >     >> Thanks!
        >     >>
        >     >> Ignacio
        >     >>
        >     >>
        >     >> On 3/14/17 1:49 PM, Aurelien Bouteiller wrote:
        >     >>> Minutes for the call have been posted here:
        >     >>> https://github.com/mpiwg-ft/ft-issues/wiki/2017-03-14
        >     >>>
        >     >>>> On Mar 14, 2017, at 15:00, Aurelien Bouteiller <bouteill at icl.utk.edu
        >     >>>> <mailto:bouteill at icl.utk.edu>
        >     >>>> <mailto:bouteill at icl.utk.edu>> wrote:
        >     >>>>
        >     >>>> Hi there,
        >     >>>>
        >     >>>> Aurelien Bouteiller is inviting you to a scheduled Zoom meeting.
        >     >>>>
        >     >>>> Topic: MPI FT WG
        >     >>>> Time: Mar 14, 2017 3:00 PM Eastern Time (US and Canada)
        >     >>>>
        >     >>>> Join from PC, Mac, Linux, iOS or
        >     >>>> Android: https://tennessee.zoom.us/j/607816420?pwd=MuG6Nboy9%2Fo%3D
        >     >>>>    Password: beef
        >     >>>>
        >     >>>> Or iPhone one-tap (US Toll):  +14086380968,607816420# or
        >     >>>> +16465588656,607816420#
        >     >>>>
        >     >>>> Or Telephone:
        >     >>>>    Dial: +1 408 638 0968 (US Toll) or +1 646 558 8656 (US Toll)
        >     >>>>    Meeting ID: 607 816 420
        >     >>>>    International numbers
        >     >>>> available: https://tennessee.zoom.us/zoomconference?m=fUOjmMyJwtMIeEsk8yo8CgLo3JR6yrTM
        >     >>>>
        >     >>>> Or an H.323/SIP room system:
        >     >>>>    H.323: 162.255.37.11 (US West) or 162.255.36.11 (US East)
        >     >>>>    Meeting ID: 607 816 420
        >     >>>>    Password: 463530
        >     >>>>
        >     >>>>    SIP: 607816420 at zoomcrc.com
        >     >>>> <mailto:607816420 at zoomcrc.com> <mailto:607816420 at zoomcrc.com>
        >     >>>>    Password: 463530
        >     >>>>
        >     >>>>
        >     >>>>
        >     >>>>> On Mar 14, 2017, at 10:54, Aurelien Bouteiller
        >     >>>>> <bouteill at icl.utk.edu <mailto:bouteill at icl.utk.edu>
        >     >>>>> <mailto:bouteill at icl.utk.edu>> wrote:
        >     >>>>>
        >     >>>>> Hi all,
        >     >>>>>
        >     >>>>> We have the FTWG call scheduled for today. I’d like to debrief the
        >     >>>>> latest MPI forum activities, and continue the discussion on
        >     >>>>> converging localized and globalized recovery.
        >     >>>>>
        >     >>>>> I attach here the slide I used during the WG time.
        >     >>>>> <20170228-mpiforum-errwg.pptx>
        >     >>>>>
        >     >>>>> We may also want to decide the time for our future meeting based on
        >     >>>>> the doodle poll initiated by Wesley a while back.
        >     >>>>> http://doodle.com/poll/s5uvmpux4nc6ki4y#table
        >     >>>>>
        >     >>>>> ===
        >     >>>>> Looking back at the notes from our last call in December, I believe
        >     >>>>> the TODO items are for Aurelien, Ignacio, and myself to flesh out the
        >     >>>>> three FT recovery proposals and then see how they would interact with
        >     >>>>> each other.
        >     >>>>>
        >     >>>>> * I believe Aurelien had some ideas about how to overcome some of the
        >     >>>>> problems raised at the last meeting. Aurelien, if you could put
        >     >>>>> together a slide or two that we could use for the discussion, that
        >     >>>>> would probably be helpful.
        >     >>>>> * I'm not sure of the status of Ignacio putting together some slides
        >     >>>>> for the reinit proposal. If I remember the meeting long ago in San
        >     >>>>> Jose, we just looked at a header. It might be nice to have something
        >     >>>>> a little more high level to point to.
        >     >>>>> * I still need to make the slides for the auto recovery strategy that
        >     >>>>> Martin proposed.
        >     >>>>>
        >     >>>>> Once that's done, we can see where these things interact and how
        >     >>>>> difficult it would be to support them together.
        >     >>>>>
        >     >>>>> Thoughts?
        >     >>>>> Wesley
        >     >>>>> _______________________________________________
        >     >>>>> mpiwg-ft mailing list
        >     >>>>> mpiwg-ft at lists.mpi-forum.org
        >     >>>>> <mailto:mpiwg-ft at lists.mpi-forum.org> <mailto:mpiwg-ft at lists.mpi-forum.org>
        >     >>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
        >     >>>>
        >     >>>
        >     >>>
        >     >>>
        >     >>> _______________________________________________
        >     >>> mpiwg-ft mailing list
        >     >>> mpiwg-ft at lists.mpi-forum.org <mailto:mpiwg-ft at lists.mpi-forum.org>
        >     >>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
        >     >>>
        >     >> _______________________________________________
        >     >> mpiwg-ft mailing list
        >     >> mpiwg-ft at lists.mpi-forum.org <mailto:mpiwg-ft at lists.mpi-forum.org>
        >     >> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
        >     >
        >     _______________________________________________
        >     mpiwg-ft mailing list
        >     mpiwg-ft at lists.mpi-forum.org
        >     https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
        >
        
    
    _______________________________________________
    mpiwg-ft mailing list
    mpiwg-ft at lists.mpi-forum.org
    https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft



More information about the mpiwg-ft mailing list