[Mpi3-ft] MPI_ANY_SOURCE

Thu Oct 6 14:17:16 CDT 2011

The blocking MPI_Recv(ANY_SOURCE) example uses readers/writers locks:
------------------------------
while(!done) {
    reader_lock();
    if (my_cnt != recognize_cnt) {
        /* New failures were detected */
        /* check failed_group and decide if ok to continue */
        if (ok_to_continue(failed_group) == FALSE) {
            reader_unlock();
            break;
        }
        my_cnt == recognize_cnt;
    }
    err = MPI_Recv(..., MPI_ANY_SOURCE, ..., comm, ...);
    if (err == MPI_ERR_PROC_FAIL_STOP) {
        /* Failure case */
        reader_unlock();
        writer_lock();
        if (my_cnt != recognize_cnt) {
            /* another thread has already re-enabled wildcards */
            writer_unlock();
            continue;
        }
        MPI_Comm_reenable_any_source(comm, &failed_group);
        ++recognize_cnt;
        writer_unlock();
        continue;
    }
    reader_unlock();
    /* Process the received message */
}
------------------------------

I think we can do the same with the nonblocking version.

So in the example:
Thread 0           Thread 1
--------------------------------------
                  Irecv(AS, TagB)[B]
Irecv(AS,TagA)[A]
/******** Process N fails ***********/
Wait()
->Warn
->Reenable()
->Wait()
                  Irecv(AS, TagC)[C]
                  Waitall()
                  --> ???
--------------------------------------
When process N fails, both [A] and [B] are marked to 'warn'. The call
to Reenable() from Thread 0 only affects -newly- posted receives
following it. So [B] would still produce a warning at the waitall()
even though ANY_SOURCE is enabled for the communicator. [C] would
block as normal since it occurred after the Reenable. The user should
use a synchronization mechanism like readers/writers to make sure that
the posting of [C] is ok or not.

So a modified version of the example might look like:
------------------------------
while(!done) {
    reader_lock();
    if (my_cnt != recognize_cnt) {
        /* New failures were detected */
        /* check failed_group and decide if ok to continue */
        if (ok_to_continue(failed_group) == FALSE) {
            reader_unlock();
            break;
        }
        my_cnt == recognize_cnt;
    }
    MPI_Irecv(..., MPI_ANY_SOURCE, ..., comm, ..., &req);
    do {
      err = MPI_Wait(req, status)
      if (err == MPI_WARN_PROC_FAIL_STOP) {
          /* Failure case */
          reader_unlock();
          writer_lock();
          /* We could use the following call to check if
           * some other thread handled the warning */
          MPI_Comm_any_source_enabled(comm, &enabled);
          if( my_cnt != recognize_cnt) {
              /* another thread has already re-enabled wildcards */
              writer_unlock();
              reader_lock();
              continue;
          }
          MPI_Comm_reenable_any_source(comm, &failed_group);
          /* Decide if we should keep waiting or not */
          if (ok_to_continue(failed_group) == FALSE) {
            /* must cancel the request if we do not intend to wait */
            MPI_Cancel(req);
            writer_unlock();
            reader_lock();
            break;
          }
          /* Otherwise keep waiting */
          ++recognize_cnt;
          writer_unlock();
          reader_lock();
          continue;
      }
    } while(err != MPI_SUCCESS);
    reader_unlock();
    /* Process the received message */
}
------------------------------

Does that address the threading problem?

We still need to hold the lock through the completion (just like the
blocking), but we do not restrict the number of MPI_Recv()s that can
be posted, just that only one can update ANY_SOURCE. And it can only
do so when all of the processes are waiting on the reader_lock() or
writer_lock() - so no outstanding MPI_Irecv(ANY) calls.

What do you all think?

-- Josh

On Thu, Oct 6, 2011 at 9:30 AM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> Yeah that's a tough one...
>
> Thread 0           Thread 1
> --------------------------------------
>                   Irecv(AS, TagB)[B]
> Irecv(AS,TagA)[A]
> /******** Process N fails ***********/
> Wait()
> ->Warn
> ->Reenable()
> ->Wait()
>                   Irecv(AS, TagC)[C]
>                   Waitall()
>                   --> ???
> --------------------------------------
>
> So [A] gets a warning, and Thread 0 fixes the communicator. What
> should happen to [B] and [C]?
>  - Should [B] and [C] both wait normally, since the warning was handled?
>  - Should [B] return a warning (since it was logically before the
> Reenable()), and [C] wait normally?
>  - Should [B] and [C] both return a warning since it was not in there
> thread context in which the Reenable was called?
>
> The first solution seems to be what we have at the moment. The second
> seems to imply that Reenable() needs to iterate over all pending
> requests to mark them appropriately as 'to warn'. I don't think the
> third option is viable since I'm not sure the concept of 'thread
> context' is well founded in the standard - except for the 'main'
> thread.
>
> One solution is to say that the user is responsible for handling this
> in their application though some complex locking scheme. I'm not 100%
> convinced that this is a sound statement at the moment.
>
> Another is what Darius suggested, that we Reenable each AS receive
> request. We would still need a Reenable for the communicator so
> blocking Recv(AS) could complete successfully. So we may have to
> differentiate Reenable_blocking_AS(comm) and
> Reenable_nonblocking_AS(request). Seems like this could get messy.
>
>
>
> I've been thinking about two alternative designs that might be
> something to consider (not abandoning the warning solution, but just a
> thought experiment to try to and find something more elegant).
>
> Alternative A:
> --------------
> Do nothing. ANY_SOURCE receives will wait for successful completion
> without being interrupted by emerging failures.
>
> The user is then responsible for canceling (or locally completing) the
> AS receive operation when they determine that it cannot complete. They
> would need to determine this out of band, possibly by polling in a
> separate thread. We would need to think through an example
> demonstrating how the user could do this.
>
> With this we also lose our mechanism for general notification.
> Previously we could use an MPI_Probe(ANY_SOURCE) to receive
> notification of new failure without modifying the matching semantics.
> With this we would need to introduce something new.
>
> I like this only because it makes the MPI specification/implementation
> simplier/easier. I don't like that it pushes the burden on the end
> user.
>
>
> Alternative B:
> --------------
> Per message callbacks. The user registers a special ANY_SOURCE warning
> callback. If there are failures in the communicator, the callback is
> fired and the user has to decide on a per-receive basis whether to
> cancel or continue waiting.
>
> The callback could be applied to both blocking and nonblocking receive
> operations, and have a signature of something like:
>  MPI_ANY_SOURCE_FAULT_CALLBACK(tag, comm, failed_grp)
> The user would return 0 to continue waiting, and -1 to cancel the
> operation and complete it with an error.
>
> The problem with this is that the callback is that it may be difficult
> to associate with a specific receive and line in the code. So it may
> be difficult to figure out if it is ok to keep waiting from just the
> matching bits and the group of failed processes. We also need to
> specify what MPI calls can be used in the callback - I would say it is
> restricted to something like MPI_Op, only local operations.
>
>
>
> On Wed, Oct 5, 2011 at 3:10 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>
>> I suspect that we'd get thread issues with the warning approach:
>>
>>    Two threads post AS receives.
>>    Thread A does a Wait and gets the warning, then does a reenable_AS
>>    Thread B does a Wait and hangs because all of its senders are dead
>>
>> We didn't have this problem before because thread B's receives would have completed with an error and so it would have gotten an error when it tried to wait on it.
>>
>> We handled this case by using reader-writer locks to prevent another thread from reenabling anysources between when the user checks that potential senders are alive and when it posts the receive.  Now we'd need to hold the lock between when we check for senders and when the request actually completes.
>>
>> Does that mean we need to explicitly reenable each AS receive?
>>
>> -d
>>
>>
>> On Oct 5, 2011, at 1:34 PM, Josh Hursey wrote:
>>
>>> Currently we state:
>>> ----------------------
>>> 17.6 Point-to-Point Communication
>>> ...
>>> When a process detects a new process failure, the ability to perform
>>> wildcard receives (i.e., receives where MPI_ANY_SOURCE has been
>>> specified for the source parameter) will be disabled on all
>>> communicators that contain the failed process. When wildcard receives
>>> are disabled on a communicator, all pending wildcard receive
>>> operations on that communicator are completed and an error with class
>>> MPI_ERR_PROC_FAIL_STOP will be returned for those operations. Any new
>>> wildcard receive operations posted to a communicator with disabled
>>> wildcard receives will be immediately completed and return an error
>>> code of the class MPI_ERR_PROC_FAIL_STOP.
>>> Wildcard receives can be re-enabled with the
>>> MPI_COMM_REENABLE_ANY_SOURCE function described below.
>>> ----------------------
>>>
>>> The problem is that by completing the pending wildcard receives with
>>> an error, we may cause unintended matching of concurrent receives. For
>>> example:
>>> -------------------
>>> Proc 0                Proc 1
>>> Irecv(ANY,TAGX); [A]
>>> Irecv(1, TAGX); [B]
>>> Waitall()
>>> /******* Proc 2 fails *********/
>>>                      Send(0, TAGX) [C]
>>>                      Send(0, TAGX) [D]
>>> -------------------
>>> The intention is that Irecv[A] will match Send[C] and Irecv[B] will
>>> match Send[D]. But if another process fails (proc 2 in this example),
>>> then Irecv[A] will complete in error. Then Irecv[B] will match
>>> Send[C], and Send[D] will remain unmatched.
>>>
>>>
>>> We need a mechanism that provides notification to the application
>>> waiting on an ANY_SOURCE receive that some process failed in this
>>> communicator while still providing the necessary matching guarantees.
>>> The user can then choose if they want to cancel the operation or
>>> continue waiting.
>>>
>>>
>>> On the teleconference today, we were thinking through a warning based
>>> concept. Instead of completing the pending nonblocking ANY_SOURCE
>>> receives, we would return a special warning code (say
>>> MPI_WARN_PROC_FAIL_STOP). The message would -not- be completed, but
>>> return to the user to decide if they wish to continue waiting or
>>> cancel the offending receive operation.
>>>
>>> For blocking ANY_SOURCE receives, the user would be returned an error
>>> (MPI_ERR_PROC_FAIL_STOP), and the operation would be completed in
>>> error. Since we do not have a request handle, there is no way to keep
>>> this receive active while the situation is resolved. So we complete
>>> the receive with an error, in a sense canceling the receive operation.
>>>
>>> For nonblocking ANY_SOURCE receives, the user would be returned a
>>> warning (MPI_WARN_PROC_FAIL_STOP), and the request would remain
>>> active. The application would then be able to either re-enable
>>> ANY_SOURCE (via the API call) or cancel the offending receive
>>> operation. The user is in control over the matching guarantees since
>>> the request remains active, but cannot complete without action from
>>> the user.
>>>
>>>
>>> In the example above, Proc 0 would return MPI_ERR_IN_STATUS from
>>> MPI_WAITALL. The status of the active requests would be
>>> MPI_ERR_PENDING (per section 3.7.3), except for the offending
>>> ANY_SOURCE operations which will have the error field set to
>>> MPI_WARN_PROC_FAIL_STOP. So Irecv[A] would be MPI_ERR_PENDING, and
>>> Irecv[B] would be MPI_WARN_PROC_FAIL_STOP. The user can then associate
>>> the error with a specific request. Once a request is identified the
>>> user can either cancel it, or re-enable ANY_SOURCE on the communicator
>>> and continue waiting.
>>>
>>>
>>> So I think this fixes the matching issue with the interface, but I'm
>>> not sure if it addresses the concerns of the hardware matching folks.
>>> Additionally, this proposal adds a new state to the request concept
>>> 'active but not completable without action by the user' and we need to
>>> think if this requires semantic qualification elsewhere in the
>>> standard.
>>>
>>>
>>> What to folks think about this solution?
>>>
>>> -- Josh
>>>
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey