[Mpi3-ft] Point2Point issue scenario with synchronous notification based on calling communicator only.

Erez Haba erezh at MICROSOFT.com
Tue Feb 10 11:17:23 CST 2009


Don't' do collective repair in rank 1. Do a non-collective repair (rank 1 does not require the participation of rank 2 to recover rank 0)

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Thomas Herault
Sent: Monday, February 09, 2009 4:47 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] Point2Point issue scenario with synchronous notification based on calling communicator only.

Hi list,

with the help of others, here is an adaptation of the "counter-example" based on p2p communications only.

MPI_COMM_WORLD is the communicator used in library A
MPI_COMM_2 is the communicator used in library B

rank 0: belongs to MPI_COMM_WORLD only
  -> in library A: MPI_Send(MPI_COMM_WORLD, dst=1);
   -> crashes

rank 1: belongs to MPI_COMM_WORLD and MPI_COMM_2
  -> in library A: MPI_Recv(MPI_COMM_WORLD, src=0)
   -> detects the failure
   -> calls the error manager: collective repair
  -> would have entere library B and called: MPI_Send(MPI_COMM_2, dst=2);

rank 2: belongs to MPI_COMM_WORLD and MPI_COMM_2
  -> does nothing in library A except entering library B.
  -> in library B: MPI_Recv(MPI_COMM_2, src=1);
    -> will never succeed

I understand from the discussion we had that a solution would be to validate COMM_WORLD for process 2 before entering library 2. I agree with that, but would like you to consider that it virtually means that we ask users to call a $n^2$ communications operation before any call of any function of any library (and possibly at the return of calls) if they want to use collective repairs. I would advocate studying a less performance-killer approach, where errors of any communicator would be notified in any MPI call.

Bests,
Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20090210/8c8137f5/attachment-0001.html>


More information about the mpiwg-ft mailing list