[mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today

Teranishi, Keita knteran at sandia.gov
Wed Dec 21 01:27:19 CST 2016


Yes, ReInit and Fenix-1.0 have the same recovery model. They use longjump
for global rollback and fix MPI communicator at the end of "Init² call.  I
am very happy to perform the feasibility studies of these three (plus one)
models.  I think that it will be great if we can explore the feasibility
through some empirical (prototyping) studies.

As for 4th (ReInit/Fenix-1.0) model, we should have clear definition MPI
communicator recovery including subcommunicators.  In order to utilize the
next generation checkpoint library (ECP¹s multi-level checkpointing
project) or accommodate application specific recovery schemes, MPI_Comm
should provide some information of its past (history failures or change in
the rank, comm_size, etc.) as well as its current state.  I am hoping that
our experience with Fenix will help to design a new spec.

Keita Teranishi
Principal Member of Technical Staff
Scalable Modeling and Analysis Systems
Sandia National Laboratories
Livermore, CA 94551
+1 (925) 294-3738

On 12/20/16, 4:55 PM, "Ignacio Laguna" <lagunaperalt1 at llnl.gov> wrote:

>Hi Keita,
>I think we all agree that there is no silver bullet solution for the FT
>problem and that each recovery model (whether it's ULFM, Reinit, Fenix,
>or ULFM+autorecovery) works for some codes but doesn't work for others,
>and that one of the solutions to cover all applications is to allow
>multiple recovery models.
>In the last telecon we discussed two ways to do that: (a) all models are
>compatible with each other; (b) they are not compatible, thus the
>application has to select the model to be used (which implies libraries
>used by the application have to support that model as well). The ideal
>case is (a), but we are not sure if it's possible, thus we are going to
>discuss each model in detail to explore that possibility. I believe case
>(b) is always a possibility, in which case you can still run Fenix on
>top of ULFM in that situation.
>BTW, correct me if I'm wrong, but Reinit and Fenix share (at a
>high-level) the same idea of global backward recovery with longjumps to
>reinject execution; thus we should call the 4rth option perhaps
>On 12/20/16 3:06 PM, Teranishi, Keita wrote:
>> All,
>> Throughout the discussion, I am a bit worried about making MPI bigger
>> than message passing interface because I wish MPI to serve a good
>> abstraction of user-friendly transport layer.  Fenix is intended to
>> leverage the minimalist approach of MPI-FT (ULFM today) to cover most of
>> online recovery models for parallel programs using MPI.  The current
>> version is designed to support SPMD (Communicating Sequential Process)
>> model, but we wish to support other models including Master-Worker,
>> Distributed Asynchronous Many Task (AMT) and Message-Logging.
>> ·ULFM: We have requested non-blocking communicator recovery as well as
>> non-blocking comm_dup and comm_split, etc.   ULFM already provides good
>> mechanism to serve master-worker type recovery like UQ, model reduction
>> and a certain family of eigenvalue solvers.  I wish to have more fine
>> control for revocation because it is possible to keep the certain
>> connection of survived process (for master-worker or task-parallel
>> computing), but it might be too difficult.
>> ·ULFM + Auto recovery: I need clarification from Wesly (as my knowledge
>> is wrong most likelyŠ but let me continue based on my assumption).
>> Fenix assumes that failure happens at a single or a small number of
>> processes.  In this model, auto-recovery could serve as un-coordinated
>> recovery because no comm_shrink call is used to fix the communicator.
>> This could help message reply of uncoordinated recovery model.  For
>> example, recovery is never manifested as ³Failure² to the survived
>> ranks, making particular message passing calls very slow.   For SPMD
>> model, adaptation is so challenging as the user needs to write how to
>> recover the lost state of failed processes.  However, I can see a great
>> benefit for implementing resilient task parallel programming model.
>> ·Communicator with hole: Master-Worker type applications will benefit
>> from this when making collectives to gather the data available.
>> ·MPI_ReInit:  MPI_ReInit is very close to the current Fenix model.  We
>> have written the API specification (see attached) to support the same
>> type of online recovery (global rollback upon process failure).  The
>> code is implemented using MPI-ULFM, and we have seen some issues with
>> MPI-ULFM that makes multiple communicator recovery convoluted.  We used
>> PMPI to hide all the details of error handling, garbage collection and
>> communicator recovery. The rollback (to Fenix_Init) is performed through
>> longjmp.  Nice features of Fenix are (1) an idea of *resilient
>> communicator* that allows the users to specify which communicator needs
>> to be automatically fixed and (2) *callback functions* to assist
>> application-specific recovery followed by communicator recovery.  We
>> originally do not intend Fenix to be part of the MPI standard because we
>> want the role of MPI confined within ³Message Passing² and do not want
>> delay the MPI standardization discussions.    My understanding with
>> MPI_ReInit is standardizing online-rollback recovery and keeping
>> PMPI/QMPI layer clean through a tight binding with the layers invisible
>> to typical MPI users (or tool developers) --- Ignacio, please correct me
>> if I am wrong.  My biggest concern of MPI_ReInit is that defining
>> rollback model by Message Passing Library may violate the original
>> design philosophy of MPI (again this is the reason why we did not
>> propose Fenix as MPI standard).  Another concern is that it might be
>> difficult to keep other recovery options open, but it gets much more
>> flexible with a few knobs in the APIs.  I think the latter is easy to
>> fix with some switches in APIs.  I think we can figure out the options
>> as we discuss further.
>> Thanks,
>> Keita
>> *From: *"Bland, Wesley" <wesley.bland at intel.com>
>> *Date: *Tuesday, December 20, 2016 at 1:48 PM
>> *To: *MPI WG Fault Tolerance and Dynamic Process Control working Group
>> <mpiwg-ft at lists.mpi-forum.org>, "Teranishi, Keita" <knteran at sandia.gov>
>> *Subject: *Re: [mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today
>> Probably here since we don't have an issue for this discussion. If you
>> want to open issues in our working group's repository
>> (github.com/mpiwg-ft/ft-issues), that's probably fine.
>> On December 20, 2016 at 3:47:25 PM, Teranishi, Keita (knteran at sandia.gov
>> <mailto:knteran at sandia.gov>) wrote:
>>     Wesley,
>>     Should I do here or github issues?
>>     Thanks,
>>     Keita
>>     *From: *"Bland, Wesley" <wesley.bland at intel.com>
>>     *Date: *Tuesday, December 20, 2016 at 1:43 PM
>>     *To: *MPI WG Fault Tolerance and Dynamic Process Control working
>>     Group <mpiwg-ft at lists.mpi-forum.org>, "Teranishi, Keita"
>>     <knteran at sandia.gov>
>>     *Subject: *Re: [mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today
>>     You don't have to wait. :) If you have comments/concerns, you can
>>     raise them here too.
>>     On December 20, 2016 at 3:38:47 PM, Teranishi, Keita
>>     (knteran at sandia.gov <mailto:knteran at sandia.gov>) wrote:
>>         All,
>>         Sorry, I could not make it today.  I will definitely join the
>>         meeting next time to make comments/suggestions on the three
>>         items (ULFM, ULFM+Auto, and ReInit) from Fenix perspective.
>>         Thanks,
>>         Keita
>>         *From: *<mpiwg-ft-bounces at lists.mpi-forum.org> on behalf of
>>         "Bland, Wesley" <wesley.bland at intel.com>
>>         *Reply-To: *MPI WG Fault Tolerance and Dynamic Process Control
>>         working Group <mpiwg-ft at lists.mpi-forum.org>
>>         *Date: *Tuesday, December 20, 2016 at 1:29 PM
>>         *To: *FTWG <mpiwg-ft at lists.mpi-forum.org>
>>         *Subject: *[EXTERNAL] Re: [mpiwg-ft] FTWG Con Call Today
>>         The notes from today's call are posted on the wiki:
>>         https://github.com/mpiwg-ft/ft-issues/wiki/2016-12-20
>>         Those who have specific items, please make progress on those
>>         between now and our next meeting. We will be cancelling the Jan
>>         3 call due to the holiday. The next call will be on Jan 17.
>>         Thanks,
>>         Wesley
>>         On December 20, 2016 at 8:15:06 AM, Bland, Wesley
>>         (wesley.bland at intel.com <mailto:wesley.bland at intel.com>) wrote:
>>             The Fault Tolerance Working Group¹s biweekly con call is
>>             today at 3:00 PM Eastern. Today's agenda:
>>             * Recap of face to face meeting
>>             * Go over existing tickets
>>             * Discuss concerns with ULFM and path forward
>>             Thanks,
>>             Wesley
>>             Join online meeting
>>             <https://meet.intel.com/wesley.bland/GHHKQ79Y>
>>             https://meet.intel.com/wesley.bland/GHHKQ79Y
>>             Join by Phone
>>             +1(916)356-2663 (or your local bridge access #) Choose
>>bridge 5.
>>             Find a local number <https://dial.intel.com>
>>             Conference ID: 757343533
>>             Forgot your dial-in PIN? <https://dial.intel.com> | First
>>             online meeting?
>>         _______________________________________________
>>         mpiwg-ft mailing list
>>         mpiwg-ft at lists.mpi-forum.org
>>         https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft

More information about the mpiwg-ft mailing list