[Mpi3-ft] Fault recovery of multiple communication libraries

Thu Feb 12 22:28:09 CST 2009

Sriram,

As I understand it, our effort aims at making the MPI library useable  
after failures, and fault tolerant applications possible above the MPI  
standard. When mixing multiple communication models, the application  
will have to do the extra work needed to keep the system view "in  
sync", as we must do this work in the MPI library to keep MPI  
processes in sync one with each others.

I agree with you that the hardest part, if you are dealing with  
multiple libraries that tolerate failures, is to coordinate them, by  
deciding, at the application level, which one is responsible for  
restarting processes, then do all the necessary work to re-init or  
mend the other libraries with the new process.

  In your example, the MPI implementation would restart the failed  
processes, mend the communicators, obtain all necessary information  
for mending the ARMCI library (including new hostnames), etc... This  
is theoretically possible in our proposal. Or do you see something  
missing?

The converse is also possible in our proposal: the ARMCI library could  
detect the failure, other ways could be used to launch the missing  
processes, and the new processes would call MPI_Init, and use an  
MPI_connect / MPI_accept scheme to connect the two wolrds and let the  
application behave as before the failure.

The point for me here is that this is a decision of the fault-tolerant  
application. Thus, MPI-3.0 should provide what is necessary for the  
application to do that, but not do it for the application. Do you  
think there is a missing API / functionality in the current proposal  
that prevent this kind of scenario?

Bests,
Thomas

Le 12 févr. 09 à 21:30, Krishnamoorthy, Sriram a écrit :

> Thomas,
>
> Consider an application that uses both MPI and another communication
> library, say ARMCI. The application requires both to be fault- 
> resilient.
> If a failure identified by one library can be notified by the other in
> some fashion, both can stay in sync with respect to their view of the
> processes and which faults have been observed. This does not have to
> collective, with each process having a possibly different view.
>
> If the communication libraries are not in sync, MPI can handle  
> failures
> whenever an MPI communication is involved. However, the other
> communication library might need a mechanism to determine (possibly
> without co-operation from other processes) whether the process-view of
> the system has changed -- whether a new communicator should be used, a
> failed process has been replaced by another one, etc.
>
> For example, with the FT-MPI-like model where a failed process is
> replaced by another, ARMCI would at least need the hostname of the new
> process to repair its state.
>
> Sriram.K
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 12 Feb 2009 16:07:41 -0500
> From: Thomas Herault <herault.thomas at gmail.com>
> Subject: Re: [Mpi3-ft] Fault recovery of multiple communication
> 	libraries
> To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
> 	Group"	<mpi3-ft at lists.mpi-forum.org>
> Message-ID: <6E20C9CC-7CA1-42CA-AB18-55075ACBBED1 at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes
>
> Hello,
>
> We have not considered mixed communication models yet, and I am not  
> sure
> we need to do so.
> Let's call A the process that fails, and B the set of processes that
> should be notified. Consider the two following cases:
>  - A also uses MPI to communicate with processes from B. Then, when a
> process from B tries to communicate with A, it will be notified by MPI
> of the error
>  - A does not use MPI to communicate with processes from B. Then, the
> application does not need help from MPI to deal with the failure:  
> MPI is
> not broken, thus does not have to be mended.
>
> Thomas
>
>
> Le 12 f?vr. 09 ? 14:46, Krishnamoorthy, Sriram a ?crit :
>
>> I would like to understand how MPI can be notified of a failure.
>> Consider another communication library (ARMCI/GASnet/...) identifying
>> a failure through its own mechanisms. In the current model, how can  
>> it
>
>> notify MPI to verify/reconfigure/recover from the error, for example
>> by performing an MPI communication to the failed process?
>>
>> Conversely, can a communication library register to be notified of an
>> error that MPI identifies and recovers from, so that that the library
>> can take appropriate action?
>>
>> Sriram.K
>>
>>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>