[Mpi3-ft] run through stabilization user-guide
bronevetsky1 at llnl.gov
Wed Feb 9 14:14:10 CST 2011
Josh, I'm rusty on the semantics here. Isn't it possible for the workers to choose MPI_ERRORS_FATAL and for the master to choose MPI_ERRORS_RETURN?
Lawrence Livermore National Lab
bronevetsky at llnl.gov
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Toon Knapen
Sent: Wednesday, February 09, 2011 11:42 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] run through stabilization user-guide
On Wed, Feb 9, 2011 at 4:22 PM, Bronevetsky, Greg <bronevetsky1 at llnl.gov<mailto:bronevetsky1 at llnl.gov>> wrote:
If the workers use communicators that are MPI_ERRORS_FATAL, if there is a disconnect with the master, they will be automatically aborted. Meanwhile, the master will be informed about their "failure" because of the disconnect and when connection to the physical nodes that previously hosted the aborted workers is re-established, the master's MPI library will see that worker tasks are dead and will not need to kill the master.
>From the user guide I did not understand that there is this kind of 'interoperability' between the different error handlers. For instance the user guide says 'The application must opt-in to the fault tolerance semantics by replacing the default error handler'.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft