[Mpi3-ft] MPI_Comm_validate_all vs MPI_Comm_reenable_collectives vs ...

Howard Pritchard howardp at cray.com
Thu Sep 1 10:11:54 CDT 2011


Hi Josh,

I vote for MPI_COMM_VALIDATE.

Howard

Josh Hursey wrote:
> While preparing the EuroMPI poster last night, I noticed that the
> naming of the agreement protocol collective is still somewhat
> problematic. I think this is what Darius mentioned in a previous
> email, but I did not get it until I was converting some of the
> examples.
> 
> The semantics of the agreement protocol collective are pretty solid:
>   Uniform agreement of the set of failures in a communicator, and
> re-enable collective operations excluding the agreed upon set of
> failures.
> 
> However the new function name (MPI_Comm_reenable_collectives) only
> addresses the latter part of the semantics. Commonly people will be
> using this routine to answer questions like:
>  "Are there any failures in this communicator?"
>  "Are there any new failures in this communicator?"
>  "Is Rank X failed in this communicator?"
> 
> Those questions focus more on the first part of the semantic. So it
> seems to me that we need to adjust the name of this function to
> reflect the full semantics of the operation.
> 
> ----- A rough example -----
> MPI_Comm_reenable_collectives(comm, &group[0]);
> do {
>   .. Do work ..
>   MPI_Comm_reenable_collectives(comm, &group[1]);
>   if( group[0] != group[1] ) { /* New failures detected */
>     group[0] = group[1];
>     continue; /* retry)
>   } else {
>     break; /* no new failures - recovery block OK */
>   }
> } while( 1 );
> ----- A rough example -----
> 
> I don't have a good suggestion at this point, but it is something we
> should address. Darius previous suggested the following:
>  MPI_COMM_VALIDATE
>  MPI_COMM_REPAIR
>  MPI_COMM_RESTORE
>  MPI_COMM_RECONCILE
>  MPI_COMM_CORRELATE
>  MPI_COMM_NORMALIZE
> 
> I would like to reserve 'Repair' and 'Restore' for the process
> recovery proposal. I am leading towards MPI_Comm_validate.
> 
> Validate implies that we are making a declaration of the accuracy or
> soundness of something (in this case the communicator and failed set).
> So that seems to give us 'agreement' and to some degree implies 'ok
> for collectives'.
> 
> What to folks think?
> 
> -- Josh
> 


-- 
Howard Pritchard
Software Engineering
Cray, Inc.



More information about the mpiwg-ft mailing list