[Mpi3-ft] Using MPI_Comm_shrink and MPI_Comm_agreement in the same application

Josh Hursey jjhursey at open-mpi.org
Mon Apr 9 09:55:18 CDT 2012


That is one way to do it. The core problem is two fold:
 1) the collective operations do not return uniformly in the presence
of process failure
 2) shrink() and agreement() cannot be invalidated, and must return uniformly.

So you can either not branch on the return code from a collective, or
branch on the return code and code around the potential bifurcation in
the program execution flow when an error occurs.

An alternative to your solution would be:
-------------------------
do {
  if (root) read_some_data_from_a_file(buffer);

  err = MPI_Bcast(buffer, .... ,root, comm);
  if (err) {
    /* Display a message, but take no collective action */
  }
  /* Let's agree for sure if we are done with the computation */
  done = (MPI_SUCCESS == err);
  MPI_Comm_agreement(comm, &done);
  if( !done ) {
    MPI_Comm_invalidate(comm);
    MPI_Comm_shrink(comm, &newcomm);
    MPI_Comm_free(&comm);
    comm = newcomm;
    continue; /* try Bcast again */
  }
  MPI_Comm_size(comm, size);
  done = do_computation(buffer, size);
  MPI_Comm_agreement(comm, &done);
  if( !done ) {
    MPI_Comm_invalidate(comm);
    MPI_Comm_shrink(comm, &newcomm);
    MPI_Comm_free(&comm);
    comm = newcomm;
    continue; /* try Bcast again */
  }
} while (!done);
MPI_Finalize();
----------------------

or, to reduce the number of agreements, don't block after an error in Bcast.
-------------------------
do {
  if (root) read_some_data_from_a_file(buffer);

  err = MPI_Bcast(buffer, .... ,root, comm);
  if( err ) {
    done = (MPI_SUCCESS == err );
  } else {
    /* Local success, so do my work until I finish it or hit a process
failure error */
    MPI_Comm_size(comm, size);
    done = do_computation(buffer, size);
  }

  /* Alternatively, if do_computation() has communication in it
   * over the communicator 'comm' then you can call invalidate here
   * to ensure it will unblock pt2pt operations. But if do_computation()
   * is purely local, then we know it will eventually emerge and find its
   * way to the agreement */
  MPI_Comm_agreement(comm, &done);
  if( !done ) {
    /* Oops. either the bcast failed or the do_computation
     * either way go back and try again */
    MPI_Comm_invalidate(comm);
    MPI_Comm_shrink(comm, &newcomm);
    MPI_Comm_free(&comm);
    comm = newcomm;
    continue; /* try Bcast again */
  }
} while (!done);
MPI_Finalize();
----------------------

Both your solution and the variations above are correct. It is just a
matter of which way works best with your code.

This does raise the subtle question of 'what if you tried to do this
in an error handler callback function' - is it ever safe to call
'shrink' or 'agreement' in the callback?

-- Josh

On Fri, Apr 6, 2012 at 12:32 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> Ignore the last email there were bugs, and the bcast should be in a while loop.
>
> -d
>
>       do {
>
>               if (root) read_some_data_from_a_file(buffer);
>               do {
>                   int val;
>                   err = MPI_Bcast(buffer, .... ,root, comm);
>                   if (err) {
>                           val = FALSE;
>                           MPI_Comm_agreement(comm, &val);
>                           MPI_Comm_invalidate(comm);
>                           MPI_Comm_shrink(comm, &newcomm);
>                           MPI_Comm_free(&comm);
>                           comm = newcomm;
>                   } else {
>                           val = TRUE;
>                           MPI_Comm_agreement(comm, &val);
>                           if (!val) {
>                           MPI_Comm_shrink(comm, &newcomm);
>                           MPI_Comm_free(&comm);
>                           comm = newcomm;
>                   }
>               } while (!val);
>
>               MPI_Comm_size(comm, size);
>
>               done = do_computation(buffer, size);
>
>               /* Let's agree for sure if we are done with the computation */
>
>               MPI_Comm_agreement(comm, &done);
>
>       while (!done);
>
>       MPI_Finalize();
>
> On Apr 6, 2012, at 11:28 AM, Darius Buntinas wrote:
>
>>
>> Wouldn't you need to do something like this?
>>
>> -d
>>
>>        do {
>>
>>                if (root) read_some_data_from_a_file(buffer);
>>
>>                err = MPI_Bcast(buffer, .... ,root, comm);
>>                if (err) {
>>                        int val = FALSE;
>>                        MPI_Comm_invalidate(comm);
>>                        MPI_Comm_shrink(comm, &newcomm);
>>                        MPI_Comm_free(&comm);
>>                        comm = newcomm;
>>                        MPI_Comm_agreement(comm, &val);
>>                } else {
>>                        int val = TRUE;
>>                        MPI_Comm_agreement(comm, &val);
>>                        if (!val) {
>>                        MPI_Comm_shrink(comm, &newcomm);
>>                        MPI_Comm_free(&comm);
>>                        comm = newcomm;
>>                }
>>                MPI_Comm_size(comm, size);
>>
>>                done = do_computation(buffer, size);
>>
>>                /* Let's agree for sure if we are done with the computation */
>>
>>                MPI_Comm_agreement(comm, &done);
>>
>>        while (!done);
>>
>>        MPI_Finalize();
>>
>> On Apr 5, 2012, at 4:50 PM, David Solt wrote:
>>
>>> I have another question about MPI_Comm_agreement:
>>>
>>> If I want to do this:
>>>
>>>        ....
>>>        do {
>>>
>>>                if (root) read_some_data_from_a_file(buffer);
>>>
>>>                err = MPI_Bcast(buffer, .... ,root, comm);
>>>                if (err) {
>>>                        MPI_Comm_invalidate(comm);
>>>                        MPI_Comm_shrink(comm, &newcomm);
>>>                        MPI_Comm_free(&comm);
>>>                        comm = newcomm;
>>>                }
>>>
>>>                MPI_Comm_size(comm, size);
>>>
>>>                done = do_computation(buffer, size);
>>>
>>>                /* Let's agree for sure if we are done with the computation */
>>>
>>>                MPI_Comm_agreement(comm, &done);
>>>
>>>        while (!done);
>>>
>>>        MPI_Finalize();
>>>
>>> This code can deadlock because some ranks may enter MPI_Comm_agreement while others detect an error in MPI_Bcast and call MPI_Comm_invalidate followed by MPI_Comm_shrink (assume that do_computation is really, really fast).   The call to MPI_Comm_invalidate will not allow the processes that have already entered MPI_Comm_agreement to leave that call (P543L45: "Advice to users. MPI_COMM_AGREEMENT maintains its collective behavior even
>>> if the comm is invalidated. (End of advice to users.)" ) and MPI_Comm_agreement cannot return an error due to the call to MPI_Comm_invalidate (P545L38:  "This function must not return an error due to process failure (error classes MPI_ERR_PROC_FAILED and MPI_ERR_INVALIDATED)...") .
>>>
>>> This would not work:
>>>        ....
>>>        do {
>>>
>>>                if (root) read_some_data_from_a_file(buffer);
>>>
>>>                err = MPI_Bcast(buffer, .... ,root, comm);
>>>                if (err) {
>>>                        MPI_Comm_invalidate(comm);
>>>                        MPI_Comm_shrink(comm, &newcomm);
>>>                        MPI_Comm_free(&comm);
>>>                        comm = newcomm;
>>>                }
>>>
>>>                MPI_Comm_size(comm, size);
>>>
>>>                done = do_computation(buffer, size);
>>>
>>>                /* Let's agree for sure if we are done with the computation */
>>>
>>>                MPI_Barrier(comm);   // don't check the error code, this is just to "catch" invalidate messages
>>>                MPI_Comm_agreement(comm, &done);
>>>
>>>        while (!done);
>>>
>>>        MPI_Finalize();
>>>
>>> because a rank may enter the barrier, get knocked out by the call to invalidate and then go on to call MPI_Comm_agreement anyway.  So we can try the following:
>>>
>>>        do {
>>>
>>>                if (root) read_some_data_from_a_file(buffer);
>>>
>>>                err = MPI_Bcast(buffer, .... ,root, comm);
>>>                if (err) {
>>>                        MPI_Comm_invalidate(comm);
>>>                        MPI_Comm_shrink(comm, &newcomm);
>>>                        MPI_Comm_free(&comm);
>>>                        comm = newcomm;
>>>                }
>>>
>>>                MPI_Comm_size(comm, size);
>>>
>>>                done = do_computation(buffer, size);
>>>
>>>                /* Let's agree for sure if we are done with the computation */
>>>
>>>                err  = MPI_Barrier(comm);
>>>                if (err) {
>>>                        MPI_Comm_invalidate(comm);
>>>                        MPI_Comm_shrink(comm, &newcomm);
>>>                        MPI_Comm_free(&comm);
>>>                        comm = newcomm;
>>>                }
>>>                MPI_Comm_agreement(comm, &done);
>>>
>>>        while (!done);
>>>
>>>        MPI_Finalize();
>>>
>>> But now we have done nothing more than move the problem down a few lines.  Some ranks may succeed the MPI_Barrier and go on to MPI_Comm_agreement while others attempt to invalidate/shrink.   Is there a solution to this problem?   How can one safely use MPI_Comm_agreement and MPI_Comm_shrink in the same application?
>>>
>>> Thanks,
>>> Dave  _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




More information about the mpiwg-ft mailing list