[Mpi3-ft] Point2Point issue scenario with synchronous notification based on calling communicator only.
Erez Haba
erezh at MICROSOFT.com
Tue Feb 10 11:17:23 CST 2009
Don't' do collective repair in rank 1. Do a non-collective repair (rank 1 does not require the participation of rank 2 to recover rank 0)
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Thomas Herault
Sent: Monday, February 09, 2009 4:47 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] Point2Point issue scenario with synchronous notification based on calling communicator only.
Hi list,
with the help of others, here is an adaptation of the "counter-example" based on p2p communications only.
MPI_COMM_WORLD is the communicator used in library A
MPI_COMM_2 is the communicator used in library B
rank 0: belongs to MPI_COMM_WORLD only
-> in library A: MPI_Send(MPI_COMM_WORLD, dst=1);
-> crashes
rank 1: belongs to MPI_COMM_WORLD and MPI_COMM_2
-> in library A: MPI_Recv(MPI_COMM_WORLD, src=0)
-> detects the failure
-> calls the error manager: collective repair
-> would have entere library B and called: MPI_Send(MPI_COMM_2, dst=2);
rank 2: belongs to MPI_COMM_WORLD and MPI_COMM_2
-> does nothing in library A except entering library B.
-> in library B: MPI_Recv(MPI_COMM_2, src=1);
-> will never succeed
I understand from the discussion we had that a solution would be to validate COMM_WORLD for process 2 before entering library 2. I agree with that, but would like you to consider that it virtually means that we ask users to call a $n^2$ communications operation before any call of any function of any library (and possibly at the return of calls) if they want to use collective repairs. I would advocate studying a less performance-killer approach, where errors of any communicator would be notified in any MPI call.
Bests,
Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20090210/8c8137f5/attachment-0001.html>
More information about the mpiwg-ft
mailing list