[Mpi3-ft] MPI_Comm_validate_all vs MPI_Comm_reenable_collectives vs ...

Josh Hursey jjhursey at open-mpi.org
Thu Sep 1 08:05:25 CDT 2011

While preparing the EuroMPI poster last night, I noticed that the
naming of the agreement protocol collective is still somewhat
problematic. I think this is what Darius mentioned in a previous
email, but I did not get it until I was converting some of the

The semantics of the agreement protocol collective are pretty solid:
  Uniform agreement of the set of failures in a communicator, and
re-enable collective operations excluding the agreed upon set of

However the new function name (MPI_Comm_reenable_collectives) only
addresses the latter part of the semantics. Commonly people will be
using this routine to answer questions like:
 "Are there any failures in this communicator?"
 "Are there any new failures in this communicator?"
 "Is Rank X failed in this communicator?"

Those questions focus more on the first part of the semantic. So it
seems to me that we need to adjust the name of this function to
reflect the full semantics of the operation.

----- A rough example -----
MPI_Comm_reenable_collectives(comm, &group[0]);
do {
  .. Do work ..
  MPI_Comm_reenable_collectives(comm, &group[1]);
  if( group[0] != group[1] ) { /* New failures detected */
    group[0] = group[1];
    continue; /* retry)
  } else {
    break; /* no new failures - recovery block OK */
} while( 1 );
----- A rough example -----

I don't have a good suggestion at this point, but it is something we
should address. Darius previous suggested the following:

I would like to reserve 'Repair' and 'Restore' for the process
recovery proposal. I am leading towards MPI_Comm_validate.

Validate implies that we are making a declaration of the accuracy or
soundness of something (in this case the communicator and failed set).
So that seems to give us 'agreement' and to some degree implies 'ok
for collectives'.

What to folks think?

-- Josh

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

