[Mpi3-ft] Using MPI_Comm_shrink and MPI_Comm_agreement in the same application

George Bosilca bosilca at eecs.utk.edu
Mon Apr 9 11:36:07 CDT 2012


Dave,

The MPI_Comm_agree is meant to be used in case of failure. It has a significant cost, large enough not to force it on users in __any__ case.

Below you will find the FT version of your example. We started from the non fault tolerant version, and added what was required to make it fault tolerant.

  george.



    success = false;
    do {
        MPI_Comm_size(comm, &size); 
        MPI_Comm_rank(comm, &rank);
        root = (0 == rank);
        do {
            if (root) read_some_data_from_a_file(buffer); 

            rc = MPI_Bcast(buffer, .... ,root, comm);
            if( MPI_SUCCESS != rc ) {  /* check only for FT type of errors */
                MPI_Comm_revoke(comm);
                break;
            }

            done = do_computation(buffer, size); 

            rc = MPI_Allreduce( &done, &success, ... MPI_OP_AND, comm );
            if( MPI_SUCCESS != rc ) {  /* check only for FT type of errors */
                success = false;  /* not defined if MPI_Allreduce failed */
                MPI_Comm_revoke(comm);
                break;
            }
        } while(false == success);
        MPI_Comm_agree( comm, &success );
        if( false == success ) {
            MPI_Comm_revoke(comm);
            MPI_Comm_shrink(comm, &newcomm);
            MPI_Comm_free(comm);
            comm = newcomm;
        }
    } while (false == success);







More information about the mpiwg-ft mailing list