[Mpi3-ft] Fault recovery of multiple communication libraries

Krishnamoorthy, Sriram sriram at pnl.gov
Thu Feb 12 20:30:12 CST 2009


Thomas,

Consider an application that uses both MPI and another communication
library, say ARMCI. The application requires both to be fault-resilient.
If a failure identified by one library can be notified by the other in
some fashion, both can stay in sync with respect to their view of the
processes and which faults have been observed. This does not have to
collective, with each process having a possibly different view. 

If the communication libraries are not in sync, MPI can handle failures
whenever an MPI communication is involved. However, the other
communication library might need a mechanism to determine (possibly
without co-operation from other processes) whether the process-view of
the system has changed -- whether a new communicator should be used, a
failed process has been replaced by another one, etc.

For example, with the FT-MPI-like model where a failed process is
replaced by another, ARMCI would at least need the hostname of the new
process to repair its state. 

Sriram.K

 
------------------------------

Message: 2
Date: Thu, 12 Feb 2009 16:07:41 -0500
From: Thomas Herault <herault.thomas at gmail.com>
Subject: Re: [Mpi3-ft] Fault recovery of multiple communication
	libraries
To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
	Group"	<mpi3-ft at lists.mpi-forum.org>
Message-ID: <6E20C9CC-7CA1-42CA-AB18-55075ACBBED1 at gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes

Hello,

We have not considered mixed communication models yet, and I am not sure
we need to do so.
Let's call A the process that fails, and B the set of processes that
should be notified. Consider the two following cases:
  - A also uses MPI to communicate with processes from B. Then, when a
process from B tries to communicate with A, it will be notified by MPI
of the error
  - A does not use MPI to communicate with processes from B. Then, the
application does not need help from MPI to deal with the failure: MPI is
not broken, thus does not have to be mended.

Thomas


Le 12 f?vr. 09 ? 14:46, Krishnamoorthy, Sriram a ?crit :

> I would like to understand how MPI can be notified of a failure.
> Consider another communication library (ARMCI/GASnet/...) identifying 
> a failure through its own mechanisms. In the current model, how can it

> notify MPI to verify/reconfigure/recover from the error, for example 
> by performing an MPI communication to the failed process?
>
> Conversely, can a communication library register to be notified of an 
> error that MPI identifies and recovers from, so that that the library 
> can take appropriate action?
>
> Sriram.K
>
>




More information about the mpiwg-ft mailing list