[mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today

Tue Dec 20 18:55:21 CST 2016

Hi Keita,

I think we all agree that there is no silver bullet solution for the FT 
problem and that each recovery model (whether it's ULFM, Reinit, Fenix, 
or ULFM+autorecovery) works for some codes but doesn't work for others, 
and that one of the solutions to cover all applications is to allow 
multiple recovery models.

In the last telecon we discussed two ways to do that: (a) all models are 
compatible with each other; (b) they are not compatible, thus the 
application has to select the model to be used (which implies libraries 
used by the application have to support that model as well). The ideal 
case is (a), but we are not sure if it's possible, thus we are going to 
discuss each model in detail to explore that possibility. I believe case 
(b) is always a possibility, in which case you can still run Fenix on 
top of ULFM in that situation.

BTW, correct me if I'm wrong, but Reinit and Fenix share (at a 
high-level) the same idea of global backward recovery with longjumps to 
reinject execution; thus we should call the 4rth option perhaps 
Reinit/Fenix.

Ignacio

On 12/20/16 3:06 PM, Teranishi, Keita wrote:
> All,
>
> Throughout the discussion, I am a bit worried about making MPI bigger
> than message passing interface because I wish MPI to serve a good
> abstraction of user-friendly transport layer.  Fenix is intended to
> leverage the minimalist approach of MPI-FT (ULFM today) to cover most of
> online recovery models for parallel programs using MPI.  The current
> version is designed to support SPMD (Communicating Sequential Process)
> model, but we wish to support other models including Master-Worker,
> Distributed Asynchronous Many Task (AMT) and Message-Logging.
>
> ·ULFM: We have requested non-blocking communicator recovery as well as
> non-blocking comm_dup and comm_split, etc.   ULFM already provides good
> mechanism to serve master-worker type recovery like UQ, model reduction
> and a certain family of eigenvalue solvers.  I wish to have more fine
> control for revocation because it is possible to keep the certain
> connection of survived process (for master-worker or task-parallel
> computing), but it might be too difficult.
>
> ·ULFM + Auto recovery: I need clarification from Wesly (as my knowledge
> is wrong most likely… but let me continue based on my assumption).
> Fenix assumes that failure happens at a single or a small number of
> processes.  In this model, auto-recovery could serve as un-coordinated
> recovery because no comm_shrink call is used to fix the communicator.
> This could help message reply of uncoordinated recovery model.  For
> example, recovery is never manifested as “Failure” to the survived
> ranks, making particular message passing calls very slow.   For SPMD
> model, adaptation is so challenging as the user needs to write how to
> recover the lost state of failed processes.  However, I can see a great
> benefit for implementing resilient task parallel programming model.
>
> ·Communicator with hole: Master-Worker type applications will benefit
> from this when making collectives to gather the data available.
>
> ·MPI_ReInit:  MPI_ReInit is very close to the current Fenix model.  We
> have written the API specification (see attached) to support the same
> type of online recovery (global rollback upon process failure).  The
> code is implemented using MPI-ULFM, and we have seen some issues with
> MPI-ULFM that makes multiple communicator recovery convoluted.  We used
> PMPI to hide all the details of error handling, garbage collection and
> communicator recovery. The rollback (to Fenix_Init) is performed through
> longjmp.  Nice features of Fenix are (1) an idea of *resilient
> communicator* that allows the users to specify which communicator needs
> to be automatically fixed and (2) *callback functions* to assist
> application-specific recovery followed by communicator recovery.  We
> originally do not intend Fenix to be part of the MPI standard because we
> want the role of MPI confined within “Message Passing” and do not want
> delay the MPI standardization discussions.    My understanding with
> MPI_ReInit is standardizing online-rollback recovery and keeping
> PMPI/QMPI layer clean through a tight binding with the layers invisible
> to typical MPI users (or tool developers) --- Ignacio, please correct me
> if I am wrong.  My biggest concern of MPI_ReInit is that defining
> rollback model by Message Passing Library may violate the original
> design philosophy of MPI (again this is the reason why we did not
> propose Fenix as MPI standard).  Another concern is that it might be
> difficult to keep other recovery options open, but it gets much more
> flexible with a few knobs in the APIs.  I think the latter is easy to
> fix with some switches in APIs.  I think we can figure out the options
> as we discuss further.
>
> Thanks,
>
> Keita
>
> *From: *"Bland, Wesley" <wesley.bland at intel.com>
> *Date: *Tuesday, December 20, 2016 at 1:48 PM
> *To: *MPI WG Fault Tolerance and Dynamic Process Control working Group
> <mpiwg-ft at lists.mpi-forum.org>, "Teranishi, Keita" <knteran at sandia.gov>
> *Subject: *Re: [mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today
>
> Probably here since we don't have an issue for this discussion. If you
> want to open issues in our working group's repository
> (github.com/mpiwg-ft/ft-issues), that's probably fine.
>
> On December 20, 2016 at 3:47:25 PM, Teranishi, Keita (knteran at sandia.gov
> <mailto:knteran at sandia.gov>) wrote:
>
>     Wesley,
>
>     Should I do here or github issues?
>
>     Thanks,
>
>     Keita
>
>     *From: *"Bland, Wesley" <wesley.bland at intel.com>
>     *Date: *Tuesday, December 20, 2016 at 1:43 PM
>     *To: *MPI WG Fault Tolerance and Dynamic Process Control working
>     Group <mpiwg-ft at lists.mpi-forum.org>, "Teranishi, Keita"
>     <knteran at sandia.gov>
>     *Subject: *Re: [mpiwg-ft] [EXTERNAL] Re: FTWG Con Call Today
>
>     You don't have to wait. :) If you have comments/concerns, you can
>     raise them here too.
>
>     On December 20, 2016 at 3:38:47 PM, Teranishi, Keita
>     (knteran at sandia.gov <mailto:knteran at sandia.gov>) wrote:
>
>         All,
>
>         Sorry, I could not make it today.  I will definitely join the
>         meeting next time to make comments/suggestions on the three
>         items (ULFM, ULFM+Auto, and ReInit) from Fenix perspective.
>
>         Thanks,
>
>         Keita
>
>         *From: *<mpiwg-ft-bounces at lists.mpi-forum.org> on behalf of
>         "Bland, Wesley" <wesley.bland at intel.com>
>         *Reply-To: *MPI WG Fault Tolerance and Dynamic Process Control
>         working Group <mpiwg-ft at lists.mpi-forum.org>
>         *Date: *Tuesday, December 20, 2016 at 1:29 PM
>         *To: *FTWG <mpiwg-ft at lists.mpi-forum.org>
>         *Subject: *[EXTERNAL] Re: [mpiwg-ft] FTWG Con Call Today
>
>         The notes from today's call are posted on the wiki:
>
>         https://github.com/mpiwg-ft/ft-issues/wiki/2016-12-20
>
>         Those who have specific items, please make progress on those
>         between now and our next meeting. We will be cancelling the Jan
>         3 call due to the holiday. The next call will be on Jan 17.
>
>         Thanks,
>
>         Wesley
>
>         On December 20, 2016 at 8:15:06 AM, Bland, Wesley
>         (wesley.bland at intel.com <mailto:wesley.bland at intel.com>) wrote:
>
>             The Fault Tolerance Working Group’s biweekly con call is
>             today at 3:00 PM Eastern. Today's agenda:
>
>             * Recap of face to face meeting
>
>             * Go over existing tickets
>
>             * Discuss concerns with ULFM and path forward
>
>             Thanks,
>
>             Wesley
>
>             .........................................................................................................................................
>
>             Join online meeting
>             <https://meet.intel.com/wesley.bland/GHHKQ79Y>
>
>             https://meet.intel.com/wesley.bland/GHHKQ79Y
>
>             Join by Phone
>
>             +1(916)356-2663 (or your local bridge access #) Choose bridge 5.
>
>             Find a local number <https://dial.intel.com>
>
>             Conference ID: 757343533
>
>             Forgot your dial-in PIN? <https://dial.intel.com> | First
>             online meeting?
>             <http://r.office.microsoft.com/r/rlidOC10?clid=1033&p1=4&p2=1041&pc=oc&ver=4&subver=0&bld=7185&bldver=0>
>
>             .........................................................................................................................................
>
>         _______________________________________________
>         mpiwg-ft mailing list
>         mpiwg-ft at lists.mpi-forum.org
>         https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>
>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-ft
>