[mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today
Teranishi, Keita
knteran at sandia.gov
Wed Dec 21 01:27:19 CST 2016
Ignacio,
Yes, ReInit and Fenix-1.0 have the same recovery model. They use longjump
for global rollback and fix MPI communicator at the end of "Init² call. I
am very happy to perform the feasibility studies of these three (plus one)
models. I think that it will be great if we can explore the feasibility
through some empirical (prototyping) studies.
As for 4th (ReInit/Fenix-1.0) model, we should have clear definition MPI
communicator recovery including subcommunicators. In order to utilize the
next generation checkpoint library (ECP¹s multi-level checkpointing
project) or accommodate application specific recovery schemes, MPI_Comm
should provide some information of its past (history failures or change in
the rank, comm_size, etc.) as well as its current state. I am hoping that
our experience with Fenix will help to design a new spec.
Thanks,
---------------------------------------------------------------------------
--
Keita Teranishi
Principal Member of Technical Staff
Scalable Modeling and Analysis Systems
Sandia National Laboratories
Livermore, CA 94551
+1 (925) 294-3738
On 12/20/16, 4:55 PM, "Ignacio Laguna" <lagunaperalt1 at llnl.gov> wrote:
>Hi Keita,
>
>I think we all agree that there is no silver bullet solution for the FT
>problem and that each recovery model (whether it's ULFM, Reinit, Fenix,
>or ULFM+autorecovery) works for some codes but doesn't work for others,
>and that one of the solutions to cover all applications is to allow
>multiple recovery models.
>
>In the last telecon we discussed two ways to do that: (a) all models are
>compatible with each other; (b) they are not compatible, thus the
>application has to select the model to be used (which implies libraries
>used by the application have to support that model as well). The ideal
>case is (a), but we are not sure if it's possible, thus we are going to
>discuss each model in detail to explore that possibility. I believe case
>(b) is always a possibility, in which case you can still run Fenix on
>top of ULFM in that situation.
>
>BTW, correct me if I'm wrong, but Reinit and Fenix share (at a
>high-level) the same idea of global backward recovery with longjumps to
>reinject execution; thus we should call the 4rth option perhaps
>Reinit/Fenix.
>
>Ignacio
>
>
>On 12/20/16 3:06 PM, Teranishi, Keita wrote:
>> All,
>>
>> Throughout the discussion, I am a bit worried about making MPI bigger
>> than message passing interface because I wish MPI to serve a good
>> abstraction of user-friendly transport layer. Fenix is intended to
>> leverage the minimalist approach of MPI-FT (ULFM today) to cover most of
>> online recovery models for parallel programs using MPI. The current
>> version is designed to support SPMD (Communicating Sequential Process)
>> model, but we wish to support other models including Master-Worker,
>> Distributed Asynchronous Many Task (AMT) and Message-Logging.
>>
>> ·ULFM: We have requested non-blocking communicator recovery as well as
>> non-blocking comm_dup and comm_split, etc. ULFM already provides good
>> mechanism to serve master-worker type recovery like UQ, model reduction
>> and a certain family of eigenvalue solvers. I wish to have more fine
>> control for revocation because it is possible to keep the certain
>> connection of survived process (for master-worker or task-parallel
>> computing), but it might be too difficult.
>>
>> ·ULFM + Auto recovery: I need clarification from Wesly (as my knowledge
>> is wrong most likelyŠ but let me continue based on my assumption).
>> Fenix assumes that failure happens at a single or a small number of
>> processes. In this model, auto-recovery could serve as un-coordinated
>> recovery because no comm_shrink call is used to fix the communicator.
>> This could help message reply of uncoordinated recovery model. For
>> example, recovery is never manifested as ³Failure² to the survived
>> ranks, making particular message passing calls very slow. For SPMD
>> model, adaptation is so challenging as the user needs to write how to
>> recover the lost state of failed processes. However, I can see a great
>> benefit for implementing resilient task parallel programming model.
>>
>> ·Communicator with hole: Master-Worker type applications will benefit
>> from this when making collectives to gather the data available.
>>
>> ·MPI_ReInit: MPI_ReInit is very close to the current Fenix model. We
>> have written the API specification (see attached) to support the same
>> type of online recovery (global rollback upon process failure). The
>> code is implemented using MPI-ULFM, and we have seen some issues with
>> MPI-ULFM that makes multiple communicator recovery convoluted. We used
>> PMPI to hide all the details of error handling, garbage collection and
>> communicator recovery. The rollback (to Fenix_Init) is performed through
>> longjmp. Nice features of Fenix are (1) an idea of *resilient
>> communicator* that allows the users to specify which communicator needs
>> to be automatically fixed and (2) *callback functions* to assist
>> application-specific recovery followed by communicator recovery. We
>> originally do not intend Fenix to be part of the MPI standard because we
>> want the role of MPI confined within ³Message Passing² and do not want
>> delay the MPI standardization discussions. My understanding with
>> MPI_ReInit is standardizing online-rollback recovery and keeping
>> PMPI/QMPI layer clean through a tight binding with the layers invisible
>> to typical MPI users (or tool developers) --- Ignacio, please correct me
>> if I am wrong. My biggest concern of MPI_ReInit is that defining
>> rollback model by Message Passing Library may violate the original
>> design philosophy of MPI (again this is the reason why we did not
>> propose Fenix as MPI standard). Another concern is that it might be
>> difficult to keep other recovery options open, but it gets much more
>> flexible with a few knobs in the APIs. I think the latter is easy to
>> fix with some switches in APIs. I think we can figure out the options
>> as we discuss further.
>>
>> Thanks,
>>
>> Keita
>>
>> *From: *"Bland, Wesley" <wesley.bland at intel.com>
>> *Date: *Tuesday, December 20, 2016 at 1:48 PM
>> *To: *MPI WG Fault Tolerance and Dynamic Process Control working Group
>> <mpiwg-ft at lists.mpi-forum.org>, "Teranishi, Keita" <knteran at sandia.gov>
>> *Subject: *Re: [mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today
>>
>> Probably here since we don't have an issue for this discussion. If you
>> want to open issues in our working group's repository
>> (github.com/mpiwg-ft/ft-issues), that's probably fine.
>>
>> On December 20, 2016 at 3:47:25 PM, Teranishi, Keita (knteran at sandia.gov
>> <mailto:knteran at sandia.gov>) wrote:
>>
>> Wesley,
>>
>> Should I do here or github issues?
>>
>> Thanks,
>>
>> Keita
>>
>> *From: *"Bland, Wesley" <wesley.bland at intel.com>
>> *Date: *Tuesday, December 20, 2016 at 1:43 PM
>> *To: *MPI WG Fault Tolerance and Dynamic Process Control working
>> Group <mpiwg-ft at lists.mpi-forum.org>, "Teranishi, Keita"
>> <knteran at sandia.gov>
>> *Subject: *Re: [mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today
>>
>> You don't have to wait. :) If you have comments/concerns, you can
>> raise them here too.
>>
>> On December 20, 2016 at 3:38:47 PM, Teranishi, Keita
>> (knteran at sandia.gov <mailto:knteran at sandia.gov>) wrote:
>>
>> All,
>>
>> Sorry, I could not make it today. I will definitely join the
>> meeting next time to make comments/suggestions on the three
>> items (ULFM, ULFM+Auto, and ReInit) from Fenix perspective.
>>
>> Thanks,
>>
>> Keita
>>
>> *From: *<mpiwg-ft-bounces at lists.mpi-forum.org> on behalf of
>> "Bland, Wesley" <wesley.bland at intel.com>
>> *Reply-To: *MPI WG Fault Tolerance and Dynamic Process Control
>> working Group <mpiwg-ft at lists.mpi-forum.org>
>> *Date: *Tuesday, December 20, 2016 at 1:29 PM
>> *To: *FTWG <mpiwg-ft at lists.mpi-forum.org>
>> *Subject: *[EXTERNAL] Re: [mpiwg-ft] FTWG Con Call Today
>>
>> The notes from today's call are posted on the wiki:
>>
>> https://github.com/mpiwg-ft/ft-issues/wiki/2016-12-20
>>
>> Those who have specific items, please make progress on those
>> between now and our next meeting. We will be cancelling the Jan
>> 3 call due to the holiday. The next call will be on Jan 17.
>>
>> Thanks,
>>
>> Wesley
>>
>> On December 20, 2016 at 8:15:06 AM, Bland, Wesley
>> (wesley.bland at intel.com <mailto:wesley.bland at intel.com>) wrote:
>>
>> The Fault Tolerance Working Group¹s biweekly con call is
>> today at 3:00 PM Eastern. Today's agenda:
>>
>> * Recap of face to face meeting
>>
>> * Go over existing tickets
>>
>> * Discuss concerns with ULFM and path forward
>>
>> Thanks,
>>
>> Wesley
>>
>>
>>.........................................................................
>>................................................................
>>
>> Join online meeting
>> <https://meet.intel.com/wesley.bland/GHHKQ79Y>
>>
>> https://meet.intel.com/wesley.bland/GHHKQ79Y
>>
>> Join by Phone
>>
>> +1(916)356-2663 (or your local bridge access #) Choose
>>bridge 5.
>>
>> Find a local number <https://dial.intel.com>
>>
>> Conference ID: 757343533
>>
>> Forgot your dial-in PIN? <https://dial.intel.com> | First
>> online meeting?
>>
>><http://r.office.microsoft.com/r/rlidOC10?clid=1033&p1=4&p2=1041&pc=oc&ve
>>r=4&subver=0&bld=7185&bldver=0>
>>
>>
>>.........................................................................
>>................................................................
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>>
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>>
More information about the mpiwg-ft
mailing list